Well, it might sound challenging at first, but with the right guidance, parsing HTML with regex can become easy.
Whether you’re a developer aiming to extract specific content from web pages or a data enthusiast looking for efficient methods to sift through massive amounts of web data, understanding the basics of parsing HTML with Regex is essential.
This blog goes deep into this technique, offering insights, examples, and best practices for those keen on mastering the art of HTML parsing using regular expressions.
What you will learn from this article?
I am assuming that you have already installed Python 3.x on your computer. If not then please install it from here.
Come, let us explore the art of HTML parsing using Python and Regex!
Regular expression or regex is like a sequence of characters that forms a search pattern that can be used for matching strings. It is a very powerful tool that can be used for text processing, data extraction, etc. It is supported by almost every language including Python, JavaScript, Java, etc. It has great community support which makes searching and matching using Regex super easy.
There are five types of Regular Expressions:
Let’s say we have this text.
text = "I have a cat and a catcher. The cat is cute."
Our task is to search for all occurrences of the word “cat” in the above-given text string.
We are going to execute this task using the re library of Python.
In this case, the pattern will be r’\bcat\b’ . Let me explain the step-by-step breakdown of this pattern.
import re text = "I have a cat and a catcher. The cat is cute." pattern = r'\bcat\b' matches = re.findall(pattern, text) print(type(matches))
In this example, we used the re.findall() function from the re module in Python to find all matches of the regular expression pattern \bcat\b in the text string. The function returned a list with the matched word “cat” as the result.
The output will look like this.
['cat', 'cat']
This is just a simple example for beginners. Of course, regular expression becomes a little complex with complex HTML code. Now, let’s test our skill in parsing HTML using regex with a more complex example.
We are going to scrape a website in this section. We are going to download HTML code from the target website and then parse data out of it completely using Regex.
For the sake of this tutorial, I am going to use this website. We will use two third-party libraries of Python to execute this task.
It is always better to decide in advance what exactly we want to scrape from the website.
We are going to scrape two things from this page.
I will make a GET request to the target website in order to download all the HTML data from the website. For that, I will be using the requests library.
import requests import re l=[] o=<> # Send a GET request to the website target_url = 'http://books.toscrape.com/' response = requests.get(target_url) # Extract the HTML content from the response html_content = response.text
Here is what we have done in the above code.
Now, we have to design a pattern through which we can extract the title and the price of the book from the HTML content. First, let’s focus on the title of the book.
The title is stored inside the h3 tag . Then inside there is a a tag which holds the title. So, the title pattern should look like this.
title_pattern = r'(.*?)
'
I know you might be wondering how I created this pattern, right? Let me explain to you this pattern by breaking it down.
So, the title_pattern is designed to match the entire HTML element for the book title, including the opening and closing tags, the tag with any attributes, and the text within the tags, which represent the book title. The captured text within the parentheses (.*?) is then used to extract the actual title of the book using the re.findall() function in Python.
Now, let’s shift our focus to the price of the book.
The price is stored inside the p tag with class price_color . So, we have to create a pattern that starts with
and ends with
.price_pattern = r'(.*?)'
This one is pretty straightforward compared to the other one. But let me again break it down for you.
: This is a literal string that matches the opening
tag with the attribute class="price_color" , which represents the HTML element that contains the book price.
tags. The .*? inside the parentheses is a non-greedy quantifier that captures any characters (except for newline) in a non-greedy (minimal) way.
So, the price_pattern is designed to match the entire HTML element for the book price, including the opening
tag with the class="price_color" attribute, the text within the
tags, which represent the book price, and the closing
tag. The captured text within the parentheses (.*?) is then used to extract the actual price of the book using the re.findall() function in Python.import requests import re # Initialize lists to store titles and prices titles = [] prices = [] # Send a GET request to the website target_url = 'http://books.toscrape.com/' response = requests.get(target_url) # Extract the HTML content from the response html_content = response.text # Define regular expression patterns for title and price title_pattern = r'' price_pattern = r'(.*?)
' # Find all matches of title and price patterns in the HTML content titles = re.findall(title_pattern, html_content) prices = re.findall(price_pattern, html_content) # Combine titles and prices in a list of dictionaries book_data = [] for title, price in zip(titles, prices): book_data.append() # Print the result for book in book_data: print(f"Title: , Price: ")
Since titles and price variables are lists, we have to run a for loop to extract the corresponding titles and prices and store them inside a list l .
for i in range(len(titles)): o["Title"]=titles[i] o["Price"]=prices[i] l.append(o) o=<> print(l)
This way we will get all the prices and titles of all the books present on the page.
You can scrape many more things like ratings, product URLs, etc using regex. But for the current scenario, the code will look like this.
import requests import re l=[] o=<> # Send a GET request to the website target_url = 'http://books.toscrape.com/' response = requests.get(target_url) # Extract the HTML content from the response html_content = response.text # Define regular expression patterns for title and price title_pattern = r'(.*?)
' price_pattern = r'(.*?)
' # Find all matches of title and price patterns in the HTML content titles = re.findall(title_pattern, html_content) prices = re.findall(price_pattern, html_content) for i in range(len(titles)): o["Title"]=titles[i] o["Price"]=prices[i] l.append(o) o=<> print(l)
In this guide, we learned how you can parse HTML with Regex. For newcomers, regular expressions may initially seem daunting, but with consistent practice, their power and flexibility become unmistakable.
Regular expressions stand as a potent tool, especially when dealing with multifaceted data structures. Our previous article on web scraping Amazon data & pricing using Python showcased the use of regex in extracting product images, offering further insights into the versatility of this method. For a deeper dive and more real-world examples, I recommend giving it a read.
I hope you like this little tutorial of parsing HTML with Regex and if you do then please do not forget to share it with your friends and on your social media.