Introduction to HTML
  • 14 May 2024
  • 2 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Introduction to HTML

  • Dark
    Light
  • PDF

Article summary

Hypertext Markup Language (HTML) is the standard language used to create documents made to display in a web browser. HTML uses nested sections to define and organize the content of the site, including images, links, and text. A web browser, like Chrome or Firefox, references an HTML document to render the webpage you see when you visit a URL.

HTML Element

An HTML element consists of an opening tag, the content, and a closing tag. These tags can be used to create a bulleted list, add a table, embed pictures and videos into the website, and much more. Below is a list of common tags you might see while web scraping. This is not an all-encompassing list and there are many more tags, so do not be alarmed if in your HTML journey you discover something new.

TagTag nameDescription
<p></p>Paragraph tagUsed to define paragraph content.
<a></a>Anchor tagUsed to link one page to another page.
<li></li>List tagUsed to list the content.
<ul></ul>Unordered List tagUsed to list the content without order.
<ol></ol>Ordered List tagUsed to list the content in numeric order.
<img></img>Image tagUsed to add image elements - note that this is a single tag. This is because the image itself is embedded in the image tag. There is nothing to close off.
<table></table>Table tagUsed to create a table.
<th/th>Table header tagUsed to define the column header in a table.
<tr/tr>Table row tagUsed to define a row in a table.
<td/td>Table cell tagUsed to define the content cells in a table.
<form> </form>Form tagUsed to create an html form.
<h1> </h1>Header 1 tagUsed to define headings, this should only be used once per page.
<h2> </h2>Header 2 tagUsed to define headings, usually a sub-header.
<h3> </h3>Header 3 tagUsed to define headings. usually used for smaller sub-header.
<em></em>Emphasis tagUsed to define emphasis (italicized) on a word(s).
<b></b>Bold tagUsed to define bold text.
<div></div>Divider tagTypically used to divide the website into different sections.

Attribute / Value

An HTML element can include an attribute that controls or modifies its behavior. Attributes generally display as name-value pairs, where the attribute name and value are separated using =.

In the example HTML below, the <span> element includes the <class> attribute and the “price”value:

<span class="price">29.99</span>

HTML Structure:

HTML documents use a hierarchy similar to a family tree to organize content, so it's common to use family terms to describe the relationship between elements. A parent is any element that contains one or more child element, and child elements are typically indented under their parent element.

image.png

Let's start at the top:

  • Parent element:
    • HTML is parent to head and body.
    • head is parent to title, while body is parent to h1* and p.
  • Sibling element
    • head and body are siblings.
    • h1 and p are siblings.
  • Child element
    • title is a child to head.
    • h1 and p are children to body.
  • Grandparent/ancestor:
    • HTML is a grandparent or ancestor to title, h1 and p.

Why it matters

The better you understand HTML, the easier it is to create effective web harvesting agents. A grasp of HTML elements and their place within a hierarchy allows you to create and customize XPath expressions that facilitate consistent and clean data capture.


Was this article helpful?