Web Scraping with Python using Beautiful Soup and Selenium

12 May 2019

9 min read

Web scraping is the process of programmatically extracting information from web pages. You would typically use a technique like web scraping when you need to retrieve information from a website that does not have an API.

Web scraping can be useful if you want to automate tedious, repetitive tasks. Perhaps you have to access an internal company portal or partner website each day to view some data for a report or your Python project needs to regularly check data on another website that does not have an API.

In this article, we'll take a look at how you can use Python to build a simple web scraping tool and then look at more advanced techniques such as getting information from websites that have data which is dynamically added to them using JavaScript after the initial page has been loaded or that require you to log in to access data.

Caution

Before getting started there are two important things you need to be aware of:

You should always have consent before scraping data from a website. You should get permission in writing from the owner or ensure that the content on their website is not protected by a clause in their website terms of use. You may be breaking the law if you do not respect any copyright that they have claimed on their website content.

Make sure that you rate limit your requests so as not to overwhelm the server you are requesting data from. If you do not add delays between your requests your script could submit hundreds or thousands of requests to a web server in immediate succession, this is effectively a Denial of Service attack.

In general, just be considerate towards the website or service that you are getting data from.

Quick Start

Okay, let's look at a simple example to fetch data from a normal public website. To get started, you probably want to look at creating and activating a virtual environment for your new project:

$ virtualenv -p Python3 webscraper
$ cd webscraper/webscraper/bin
$ source activate
(webscraper) $ mkdir webscraper && cd webscraper

If you're unfamiliar with virtual environments, you can learn more about them here.

Once your virtual environment is set up and active we can start by installing the Beautiful Soup library, which we'll use to parse web page data and extract the parts that we are interested in. To install Beautiful Soup enter the following into your terminal:

(webscraper) $ pip install beautifulsoup4

Now we can write some simple Python code to fetch a web page, create a Beautiful Soup object from that and extract some basic information:

import urllib2
from bs4 import BeautifulSoup


website = urllib2.urlopen('http://www.example.com')
page = BeautifulSoup(website.read()), 'html.parser')

# Find the first <h1> element in the page
heading = page.find('h1')

# Print the text inside an HTML element
print(heading.text)

Let's walk through the above code. Line 1 imports the urllib2 module which we'll use to request a website from a URL, this module is included in the standard Python library so we do not need to install it with pip. Line 2 imports the BeautifulSoup class from the Beautiful Soup library that we installed. Line 4 uses the urllib2 library to fetch a webpage. Line 5 creates a Beautiful Soup object from the HTML of the website that we've retrieved. Then we find the first h1 element in the page source code and print the text within it. The above code example uses the urllib2 module but you may prefer to use the common requests library to request your pages.

Anyway, now that we've created a Beautiful Soup object called page. We can look at how to extract various elements of this web page.

Basic Examples

Here are some common operations that we can perform to extract certain types of data from our page object.

Find a single element

Using the find() method that belongs to page will return the first HTML element it finds in the page as a single object:

# Find the first <p> in the page
p = page.find('p')

Find multiple elements

Here we can use the find_all() method to create a list of all the <p> elements in the page.

# Create a list containing all the <p> elements in the page
paragraphs = page.find_all('p')

Find all links in a page

Again, we can use the find_all() method to create a list that contains all of the <a> elements in the page. We can then extract the URL from each <a> tag's href attribute and put that into another list called urls:

Create a list of all <a> elementslinks = page.find_all('a')# Create a list and fill it with urls from the link elementsurls = []for link in links:    urls.append(link.get('href'))# Do same as code above but more concisely with a list comprehensionurls = [ link.get('href') for link in links ]

Just remember that if you are then going to request all of these URLs for parsing that you should set a delay between each request so as not to overwhelm the server with potentially hundreds or thousands of requests in immediate succession. To instruct our code to wait, we can import the time module at the top of our code and then use its sleep() function to delay our code for one second between requests:

import time
import urllib2 and bs4

# request a web page and create a BeautifulSoup object...
# extract urls from all <a> elements...

# request a url, create a BeautifulSoup object, wait one sec before repeating
pages = []
for url in urls:
    web_page = request.get(url)
    page = BeautifulSoup(web_page.text, 'html.parser')
    pages.append(page)
    time.sleep(1)

Find elements by class

Here are some examples of how you can locate an element by its class:

# Create a list of all elements that have the class of 'card'
cards = page.find_all(class_='card')

# Create a list of all <div> elements that have the class of 'card'
cards = page.find_all('div', class_='card')

# Create a list of all elements that have the class 'card', using CSS selector syntax
cards = page.select('.card')

# Create a list of all elements that have the class 'large' AND 'blue'
only_large_blues = page.select('.large.blue')

# Create a list of all elements that have either class of 'small' OR 'red'
any_small_reds = page.select('.small, .red')

The select() method allows you to search for elements using the CSS Selector syntax. Which is also handy when you need to search for an element by ID.

Find an element by ID

You can use select() to find an element by ID using the CSS selector syntax like this:

link = page.select('#link1')

# Print the text of the first item in the list
print(link[0].text)

Note that select() will always return a list, so make sure to handle the result accordingly.

Extract element attributes

You can get a list of attribute values for any element using get(). This could be useful for getting attributes like id, class, href and so on.

# Find the first <p> element in page
p = page.find('p')

# Print a list of values in p's class attribute
print(p.get('class'))

Extract data from a table

Tables can be a little tricky to work with. Let's have a look at a simple example, assume we're accessing a web page that has the following HTML:

<!DOCTYPE html>
<html lang="en">

<head>
    <title>Dashboard</title>
<body>
    <table>
        <thead>
            <tr>
                <th>Country</th>
                <th>Visitors</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>United States</td>
                <td>1,503</td>
            </tr>
            <tr>
                <td>India</td>
                <td>987</td>
            </tr>
            <tr>
                <td>United Kingdom</td>
                <td>786</td>
            </tr>
        </tbody>
    </table>
    </head>
</body>

</html>

First, we need to extract each table row from our page object. We can do this by using the select() method. Here we are looking for any <tr> elements that are nested within any <tbody> elements that are nested within any <table> elements:

rows = page.select('table tbody tr')
data = {}
for row in rows:
    tds = row.select('td')
    if tds:
        data[tds[0].text] = tds[1].text

Once we have the rows we create a dictionary called data and then go through each row, find the <td> elements in the row, and add a new item to data where the key is the first cell in the row (country) and the value for that key is the second cell in the row (number of visitors).

This using a dictionary may work well for a two column table but it may make sense to adjust your code accordingly for tables that may have more columns.

The above examples should be enough to handle most common web scraping scenarios. For a full list of methods you can use with Beautiful Soup objects, you can refer to their documentation.

Advanced examples

Sometimes you might have data that is dynamically generated after a website has loaded or requires you to be logged in before you can access it. We'll take a look at how you can handle these situations in this section.

Dynamic content

In some cases, you might want to scrape pages with content that is loaded using JavaScript after the initial page has loaded. In this case, you'll need to use a tool like Selenium to act as your web browser.

Selenium is like a web browser that you can control with code, and there is a version that we can control using Python. You can tell it to request a web page, fill out form fields, click a button and so on. So we can use Selenium to get to a point where the data we want is loaded into the web page, also known as the Document Object Model (DOM).

Using Selenium is a bit slower than something more lightweight, like urllib2 or requests, because Selenium has the additional overhead of opening up an actual browser window and so on. But the difference in speed will only be significant if your web scraper is requesting a lot of pages, for relatively simple scraping projects it should be fine.

To install Selenium, enter the following into your terminal:

(webscraper) $ pip install selenium

Now we'll write some example code that no longer uses urllib2 to request web pages but instead opens a new Firefox browser, navigates to a website, clicks on a button with the id of show-stats, waits for two seconds and only then creates a Beautiful Soup object with the current state of the web page's source code:

import time
from selenium import webdriver
from bs4 import BeautifulSoup


# Create a new Firefox browser object
browser = webdriver.Firefox()

try:
    # Go to a website, click the element with the id 'show-data' and wait 2 secs
    browser.get('https://www.example.com')
    browser.find_element_by_id('show_stats').click()
    time.sleep(2)

    # Create BeautifulSoup object from page source.
    page = BeautifulSoup(browser.page_source, 'html.parser')

    # Parse and extract the data that you need.
    rows = page.select('table#stats tbody tr')
    data = {}
    for row in rows:
        tds = row.select('td')
        if tds:
            data[tds[0].text] = tds[1].text
except Exception as e:
    print(e)
finally:
    browser.quit()

As your code starts to become more complex you should consider putting it into try/except/finally blocks, so that if your code encounters an error at some point you can tell it what to do instead of having it stop running at the point of the exception. For example, it may be annoying when you run your code, it opens a browser, encounters an error and stops running without closing the browser. So the above example shows how we can write except and finally blocks of code to handle this situation.

In the above example, Python will attempt to run the code in the try block, if it encounters an error (exception) it will run the code in the except block (i.e. print the variable containing the Exception that was raised). The code in the finally block will always be run after whether the try block has been executed successfully or if an exception was raised. So in this case, we want our browser to close after either of these occur. If you don't put browser.quit() in a finally block and instead just have it in the normal Python flow below your try / except, it will only be executed if the try block completes without encountering an exception.

Logging in to a website

In some cases, you may need to log in to a website first before we you can access the content. In this case, we could tell Selenium to navigate to the login page, enter our username and password and log in.

from selenium import webdriverfrom bs4 import BeautifulSoup# Create a new Chrome browser objectbrowser = webdriver.Firefox()try:    browser.get('https://www.example.com')    
    # Find the log in button and click it    login = browser.find_element_by_link_text('Log in')    login.click()
        # Find the username field and type 'myusername' into it    username = browser.find_element_by_css_selector('input[name=username]')    username.send_keys("USERNAME")        # Find the password field and type 'mypassword' into it    password = browser.find_element_by_css_selector('input[name=password]')    password.send_keys("PASSWORD")    # Find the submit button and click it    submit = browser.find_element_by_css_selector('button[type=submit]')    submit.click()    """
    You should now be logged in. Navigate to the required page    and extract the data you are looking for.
    """    logout = browser.find_element_by_link_text('Log in')    logout.click()except Exception as e:    print(e)finally:    browser.quit()

You should avoid hard-coding any sensitive credentials like usernames or passwords into your code, especially if you're going to store it in a shared source code repository like git, as it's then extremely likely that someone could come across your credentials and gain access to your account.

Instead, you should look at using environment variables to keep your credentials separately abstracted from your source code. This is considered good practice when you need to pass any sensitive credentials like usernames, passwords or API keys into your code.

Scheduling your script

Now that you're able to scrape data, you may want to schedule your script to run at certain intervals, say once a day or hour. In order to do this, you'll need to have your script running on a server somewhere because your laptop or computer may not always be powered on or online. Serverless compute services like AWS Lambda Functions, Google Cloud Functions or Microsoft Azure Functions are ideal for this kind of situation. Setting up your script on one of these services is beyond the scope of this article but you can refer to each of the links above for documentation on how to use their services.

Saving data

Now that you've scraped your data you could simply print it out to your terminal and copy and paste it from there. Alternatively, you might want to save this data into a file somewhere for future reference.

To save the results from our data dictionary above to a simple CSV file we can write some code that looks like this.

with open('results.txt', 'w') as file:   for key, value in results.items():      file.write(key + "," + value)