Prof. Frenzel
6 min readOct 5, 2024
#KB Web Scraping — Fundamentals

Dear fellow Data Scientists,

Access to timely and relevant data is key to building effective models. Web scraping provides a flexible method for gathering such data, particularly when APIs are unavailable or limited. This technique allows us to collect information like product reviews or sentiment data, which can set our models apart.

In this article, written by 👋Ryan Shihabi, we’ll cover the basics of web scraping, focusing on how to extract and prepare web data. From understanding HTML structure to parsing it using BeautifulSoup, this guide offers a hands-on approach to data collection.

The Structure of HTML

HTML(HyperText Markup Language) organizes its data in a tag-based format where data can either exist inside one tag or in between two tags. Let’s break it down with a simple example:

webscraper.io Demo HTML structure

Data inside a tag is called an attribute and they provide a great way to filter for specific information when scraping.

Example h4 tag

Data in between tags is referenced as text or a value and holds the desired information that is needed to be scraped.

Web Scraping with BeautifulSoup

Setup

Once you understand the HTML structure, the next step is to install BeautifulSoup, a popular Python library for parsing HTML.

pip install beautifulsoup4

Next, import the request and bs4 package into your file

import requests
from bs4 import BeautifulSoup

Requesting the Web Page

In order to get the HTML for the website, we must send what is called an HTTP GET Request. This is, simply put, a request that allows us to retrieve data from a web server. Once we send the request and get a response, we will have the HTML from that website.

page = requests.get(URL)

Parsing the HTML

With our newly returned HTML, we will give it to BeautifulSoup to parse through the data. Parsing just means we are making the data easily searchable for a program.

soup = BeautifulSoup(page.text, 'html.parser')

Filtering Parsed Data

With our HTML data parsed by BeautifulSoup, we can now look for the values we are trying to scrape.

The website I am using is a demo site specifically tailored to practice web scraping: https://webscraper.io/test-sites/e-commerce/allinone

On this website, I will grab the name, price, and rating of these three laptops.

webscraper.io Demo E-Commerce Site

Let’s take a look at the HTML and see where these values are located.

I like to use inspect element in Chrome to look around the HTML

The goal of looking through the HTML is to find a specific attribute that is unique to the tag holding the values we want to scrape.

  • It looks like the price of each card can be found in a <h4> tag with the class attribute: “price”.
  • The name of the product can be found in a <a> tag with the class attribute: “title”.
  • Another valuable piece of information, rating, can be found in a <p> tag under an attribute called “data-rating”.

Finding tags with unique attributes now allows us to filter the BeautifulSoup data and grab our values. There is a function called find() that takes in two common inputs:

  1. The tag type: “h4”, “p”, “a”, etc.
  2. The attribute: class_, id, or a dictionary

Scraping One Tag

# Extract price
price = soup.find('h4', class_='price').text.strip()
# Extract product title
product_title = soup.find('a', class_='title').text.strip()
# Extract rating
rating = soup.find('p', {'data-rating': True})['data-rating']

Note it is important to grab the text value by specifying “.text” after the data has been located in the parsed HTML.

If the attribute name is not common, such as class_ and id, you can place the attribute name as a dictionary key and set its value to True.

The strip() function simply removes new lines and extraneous spacing.

Scraping Multiple Tags

Replacing find() with find_all() will return a list of all results instead of just the first mention. Note it is better to run find_all() on the tag that contains all of the data you want to scrape. This way the data is easier to search and locate.

In this case, I found all the product cards are contained in a <div> tag with the class name “card-body”.

products = soup.find_all('div', class_='card-body')

for product in products:
# Extract price
price = product.find('h4', class_='price').text.strip()

# Extract product title
product_title = product.find('a', class_='title').text.strip()

# Extract rating (from 'data-rating' attribute)
rating = product.find('p', {'data-rating': True})['data-rating']
print(price, product_title, rating)
$356.49 Lenovo V110-15... 2
$93.99 Samsung Galaxy 3
$1299 MSI GL62VR 7RF... 3

Data Filtering

Before placing these values into a dataset, it is a good idea to filter out any unnecessary characters and cast each value to its correct data type.

The price values contain a dollar sign and are stored in a string data type. In order to convert the data to a floating type, we must first remove the dollar sign. It is important to ensure this pattern is consistent. If it is not, you will have to check for more conditions

# Text data
price = product.find('h4', class_='price').text.strip()

# Splicing the original price text to not include first character($)
price_float = float(price[1:])

The description text sometimes includes an ellipsis (…). This can be removed by using the replace() string method. The replace() method takes in two parameters

  1. The string to find
  2. The string to replace the found string with
# Product Name
product_title = product.find('a', class_='title').text.strip()

# Replace the ellipsis with nothing
product_title_filtered = product_title.replace("...", "")
# Filtered Output
356.49 Lenovo V110-15 2
93.99 Samsung Galaxy 3
1299 MSI GL62VR 7RF 3

Common Problems

  • Timeouts: If you send too many requests to a website, you will probably receive a timeout response with no data. Look into request limit of the website and set your request intervals accordingly. More on this in the best practices section.
  • JavaScript Rendering: If you go to scrape a website and the request returns nothing, this may mean that the application renders content after the request is returned. To counteract this problem, you will need to use a program called Selenium that scrapes data based on the browser view compared to just taking in the request.

Best Practices

If you need to gather a ton of information quickly, please read up on the site’s scraping and request limits. If a website only allows 20 requests a minute, make sure you time your request intervals accordingly. That can be achieved with something as simple as a wait or sleep method

from time import sleep

# Sleep for 1 second
sleep(1)

These are the essential steps needed to begin web scraping and you will continue to learn more based on the specific project you are using web scraping for. Below are further resources that may align with the project you want to use the BeautifulSoup library with.

Further Resources And Specific Examples

For additional information and more advanced techniques in web scraping, explore these resources:

Prof. Frenzel

Data Scientist | Engineer - Professor | Entrepreneur - Investor | Finance - World Traveler