Introduction to HTML

Updated on 14 May 2024
2 Minutes to read
Contributors

Print
Share
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Hypertext Markup Language (HTML) is the standard language used to create documents made to display in a web browser. HTML uses nested sections to define and organize the content of the site, including images, links, and text. A web browser, like Chrome or Firefox, references an HTML document to render the webpage you see when you visit a URL.

HTML Element

An HTML element consists of an opening tag, the content, and a closing tag. These tags can be used to create a bulleted list, add a table, embed pictures and videos into the website, and much more. Below is a list of common tags you might see while web scraping. This is not an all-encompassing list and there are many more tags, so do not be alarmed if in your HTML journey you discover something new.

Tag	Tag name	Description
`<p></p>`	Paragraph tag	Used to define paragraph content.
`<a></a>`	Anchor tag	Used to link one page to another page.
`<li></li>`	List tag	Used to list the content.
`<ul></ul>`	Unordered List tag	Used to list the content without order.
`<ol></ol>`	Ordered List tag	Used to list the content in numeric order.
`<img></img>`	Image tag	Used to add image elements - note that this is a single tag. This is because the image itself is embedded in the image tag. There is nothing to close off.
`<table></table>`	Table tag	Used to create a table.
`<th/th>`	Table header tag	Used to define the column header in a table.
`<tr/tr>`	Table row tag	Used to define a row in a table.
`<td/td>`	Table cell tag	Used to define the content cells in a table.
`<form> </form>`	Form tag	Used to create an html form.
`<h1> </h1>`	Header 1 tag	Used to define headings, this should only be used once per page.
`<h2> </h2>`	Header 2 tag	Used to define headings, usually a sub-header.
`<h3> </h3>`	Header 3 tag	Used to define headings. usually used for smaller sub-header.
`<em></em>`	Emphasis tag	Used to define emphasis (italicized) on a word(s).
`<b></b>`	Bold tag	Used to define bold text.
`<div></div>`	Divider tag	Typically used to divide the website into different sections.

Attribute / Value

An HTML element can include an attribute that controls or modifies its behavior. Attributes generally display as name-value pairs, where the attribute name and value are separated using =.

In the example HTML below, the <span> element includes the <class> attribute and the “price”value:

<span class="price">29.99</span>

HTML Structure:

HTML documents use a hierarchy similar to a family tree to organize content, so it's common to use family terms to describe the relationship between elements. A parent is any element that contains one or more child element, and child elements are typically indented under their parent element.

Let's start at the top:

Parent element:
- HTML is parent to head and body.
- head is parent to title, while body is parent to h1* and p.
Sibling element
- head and body are siblings.
- h1 and p are siblings.
Child element
- title is a child to head.
- h1 and p are children to body.
Grandparent/ancestor:
- HTML is a grandparent or ancestor to title, h1 and p.

Why it matters

The better you understand HTML, the easier it is to create effective web harvesting agents. A grasp of HTML elements and their place within a hierarchy allows you to create and customize XPath expressions that facilitate consistent and clean data capture.

Was this article helpful?

What's Next

Introduction to XPath