- Print
- DarkLight
- PDF
Hypertext Markup Language (HTML) is the standard language used to create documents made to display in a web browser. HTML uses nested sections to define and organize the content of the site, including images, links, and text. A web browser, like Chrome or Firefox, references an HTML document to render the webpage you see when you visit a URL.
HTML Element
An HTML element consists of an opening tag, the content, and a closing tag. These tags can be used to create a bulleted list, add a table, embed pictures and videos into the website, and much more. Below is a list of common tags you might see while web scraping. This is not an all-encompassing list and there are many more tags, so do not be alarmed if in your HTML journey you discover something new.
Tag | Tag name | Description |
---|---|---|
<p></p> | Paragraph tag | Used to define paragraph content. |
<a></a> | Anchor tag | Used to link one page to another page. |
<li></li> | List tag | Used to list the content. |
<ul></ul> | Unordered List tag | Used to list the content without order. |
<ol></ol> | Ordered List tag | Used to list the content in numeric order. |
<img></img> | Image tag | Used to add image elements - note that this is a single tag. This is because the image itself is embedded in the image tag. There is nothing to close off. |
<table></table> | Table tag | Used to create a table. |
<th/th> | Table header tag | Used to define the column header in a table. |
<tr/tr> | Table row tag | Used to define a row in a table. |
<td/td> | Table cell tag | Used to define the content cells in a table. |
<form> </form> | Form tag | Used to create an html form. |
<h1> </h1> | Header 1 tag | Used to define headings, this should only be used once per page. |
<h2> </h2> | Header 2 tag | Used to define headings, usually a sub-header. |
<h3> </h3> | Header 3 tag | Used to define headings. usually used for smaller sub-header. |
<em></em> | Emphasis tag | Used to define emphasis (italicized) on a word(s). |
<b></b> | Bold tag | Used to define bold text. |
<div></div> | Divider tag | Typically used to divide the website into different sections. |
Attribute / Value
An HTML element can include an attribute that controls or modifies its behavior. Attributes generally display as name-value pairs, where the attribute name and value are separated using =.
In the example HTML below, the <span>
element includes the <class>
attribute and the “price”
value:
<span class="price">29.99</span>
HTML Structure:
HTML documents use a hierarchy similar to a family tree to organize content, so it's common to use family terms to describe the relationship between elements. A parent is any element that contains one or more child element, and child elements are typically indented under their parent element.
Let's start at the top:
- Parent element:
- HTML is parent to head and body.
- head is parent to title, while body is parent to h1* and p.
- Sibling element
- head and body are siblings.
- h1 and p are siblings.
- Child element
- title is a child to head.
- h1 and p are children to body.
- Grandparent/ancestor:
- HTML is a grandparent or ancestor to title, h1 and p.
Why it matters
The better you understand HTML, the easier it is to create effective web harvesting agents. A grasp of HTML elements and their place within a hierarchy allows you to create and customize XPath expressions that facilitate consistent and clean data capture.