Introduction to HTML and XPath
  • 04 Jun 2020
  • 3 Minutes To Read
  • Contributors
  • Print
  • Share
  • Dark
    Light

Introduction to HTML and XPath

  • Print
  • Share
  • Dark
    Light

Introduction to HTML

Web pages are written in a markup language called HTML. A browser, like Chrome or Firefox, uses the HTML to produce the visual webpage you see when you visit that page. Everything on the page, such as an image, link, or paragraph of text, can be found in the HTML. Here’s an example of some simple HTML.

<ul class=“bookstore”>  
    <li class=“book”>  
        <img src=“BookCover.jpg”/>  
        <div class=“title” lang=“en”>Harry Potter</div>  
        <span class=“author”>J K. Rowling</span>  
        <span class=“year”>2005</span>  
        <span class=“price”>29.99</span>  
    </li>  
</ul> 

Here’s how that HTML might be rendered on a web page.
An example of how the sample HTML might look on a webpage.

Element

Each object on a web page is contained by an element in the HTML. For example, in the image above you can see a blue price: 29.99. That price is contained in a <span> element in the example HTML.

Attribute / Value

Some HTML elements have an attribute that describes that element. An attribute usually has a value.

In the example HTML above, the <span> element that contains the price has an attribute called <class> with the value “price”:

<span class="price">29.99</span>

Parent / Child / Sibling

Some HTML elements contain other elements. In the HTML example above, the <img> element is contained in the <li> element. In other words, the <img> element is a child of the <li> element. This is typically shown by it being indented under its parent element (<li class="book">):

How child elements are indented under their parent element in HTML/XML.

A parent is any element that contains one or more child elements. The <div class=“title” lang=“en”> element is also a child to the <li>. It's also a sibling to the <img> element since they share a parent element:

How sibling child elements are indented under their parent element in HTML/XML.

Introduction to XPaths

Simple XPaths

Let's start with a simple analogy: each building in the real world has a street address, which can be used to find that building’s location on a map. Similarly, each object on a web page (such as an image, piece of text, or a price) has an XPath (the address), which can be used to find that object’s location in the web page’s HTML.

We’ll use the previous HTML example as a reference. The author shown in the sample HTML is contained in the <span class=“author”> element. This is what an XPath for that element would look like:

/ul/li/span

The XPath defines the path through the HTML from parent-to-child to parent-to-child. You don't need to list any of the element's siblings; only the direct parent-to-child line to the target object, <span class="author", which will capture the data J K. Rowling.

What if there is more than one child element?

Let’s take another look at our HTML sample:

<ul  class=“bookstore”>  
    <li  class=“book”>  
        <img  src=“BookCover.jpg”/>  
        <div  class=“title”  lang=“en”>Harry Potter</div>  
        <span  class=“author”>J K. Rowling</span>  
        <span  class=“year”>2005</span>  
        <span  class=“price”>29.99</span>  
    </li>  
</ul>

An XPath that targets the price, or third <span> element, would look like this:

/ul/li/span[3]

Attributes

Instead of numbers, you can also target a specific element by using its attributes. In the HTML sample above, the third <span> element has a class attribute with the value “price”. Let’s write an XPath that targets that span so we can harvest the price:

/ul[1]/li[1]li[1]/span[@class=”price”]

The XPath above uses the @ symbol to target the named attribute within a <span> element.

Short XPaths

So far, we’ve been using long XPaths. These start from the top of the HTML and go through all the elements to get to the targeted element. If an XPath describes the target XPaths specifically enough, perhaps by using its attributes as we did above, it doesn’t need to start at the top of the HTML. Let’s shorten the XPath we’ve written:

//span[@class=”price”]

The XPath above will find the price that has the class attribute with the value “price” regardless of its location in the HTML or on the web page.

Note: The short XPath always starts with a double-forward slash (“//”) instead of a single forward slash (“/”).

Was This Article Helpful?