Hypertext Markup Language (HTML) is the standard language used to create documents made to display in a web browser. HTML uses nested sections to define and organize the content of the site, including images, links, and text. A web browser, like Chrome or Firefox, references an HTML document to render the webpage you see when you visit a URL.
An HTML element consists of an opening tag, the content, and a closing tag. These tags can be used to create a bulleted list, add a table, embed pictures and videos into the website, and much more. Below is a list of common tags you might see while web scraping. This is not an all-encompassing list and there are many more tags, so do not be alarmed if in your HTML journey you discover something new.
||Paragraph tag||Used to define paragraph content.|
||Anchor tag||Used to link one page to another page.|
||List tag||Used to list the content.|
||Unordered List tag||Used to list the content without order.|
||Ordered List tag||Used to list the content in numeric order.|
||Image tag||Used to add image elements - note that this is a single tag. This is because the image itself is embedded in the image tag. There is nothing to close off.|
||Table tag||Used to create a table.|
||Table header tag||Used to define the column header in a table.|
||Table row tag||Used to define a row in a table.|
||Table cell tag||Used to define the content cells in a table.|
||Form tag||Used to create an html form.|
||Header 1 tag||Used to define headings, this should only be used once per page.|
||Header 2 tag||Used to define headings, usually a sub-header.|
||Header 3 tag||Used to define headings. usually used for smaller sub-header.|
||Emphasis tag||Used to define emphasis (italicized) on a word(s).|
||Bold tag||Used to define bold text.|
||Divider tag||Typically used to divide the website into different sections.|
Attribute / Value
An HTML element can include an attribute that controls or modifies its behavior. Attributes generally display as name-value pairs, where the attribute name and value are separated using =.
In the example HTML below, the
<span> element includes the
<class> attribute and the
HTML documents use a hierarchy similar to a family tree to organize content, so it's common to use family terms to describe the relationship between elements. A parent is any element that contains one or more child element, and child elements are typically indented under their parent element.
Let's start at the top:
- Parent element:
- HTML is parent to head and body.
- head is parent to title, while body is parent to h1* and p.
- Sibling element
- head and body are siblings.
- h1 and p are siblings.
- Child element
- title is a child to head.
- h1 and p are children to body.
- HTML is a grandparent or ancestor to title, h1 and p.
Why it matters
The better you understand HTML, the easier it is to create effective web harvesting agents. A grasp of HTML elements and their place within a hierarchy allows you to create and customize XPath expressions that facilitate consistent and clean data capture.