Web Scraper: Python vs Rust

24 Feb 2021

3 min read

7 comments

Error handling

Web scraping is about as error-prone as you can imagine. Pages might not exist, HTML elements might not always be there… And so, a language that can support errors and edge cases well at runtime and not crash is a huge plus.

[Python]

In python, the most common way to handle errors in code would be with an if-statement like so:

response = requests.get(URL)
if response.status_code == 200:
    content = response.content

    soup = bs.BeautifulSoup(content, 'lxml')
    articles = soup.find_all('article')
    if articles:
        ...

In itself, handling edge cases with if is not so bad and pretty natural. But, it has some drawbacks:

It makes the code harder to read as it is difficult to dissociate business logic and error handling.
if-statements are most of the time added in retroaction of a bug, which makes coding slower and not fun.
if-statements error handling logic might vary between packages, which makes coding very tedious.

[Rust]

In Rust, functions return either a success or an error, and, you have to deal with each case before compiling. This makes the code way more robust against errors at runtimes. In practice, the code might look like this:

let name = match node.find(Name("h3")).next() {
    Some(h3) => h3.text(),
    None => "".to_string(),
};

Rust also has other ways to handle Success / Errors Result, such as “?” à la Typescript:

let response = reqwest::get(&url).await?.text().await?;

The drawbacks are that developing in Rust is slower as you have to deal with every edge case during development.

Sometimes, it might even be tricky to understand which edge cases you’re trying to deal with. To be noted, that you HAVE to deal with every edge case in Rust otherwise, Rust will not compile. This is how serious Rust is about error handling.

Async scraping

During scraping, most of the time is lost downloading file rather than computing.

However, with synchronous runtimes, pages are scraped one by one and so downloaded one by one. Each download can take time and idle the whole process. Therefore, if we can manage to not wait for the completion of each download, we will gain efficiency.

[Python]

It is possible using the “asyncio” library, and it might look like that:

import asyncio
import requests
import bs4 as bs
import csv

URL = "http://books.toscrape.com/catalogue/page-%d.html"


async def get_book(url, spamwriter):
    response = requests.get(url)
    if response.status_code == 200:
        content = response.content
        soup = bs.BeautifulSoup(content, 'lxml')
        articles = soup.find_all('article')

        for article in articles:
            information = [url]
            information.append(article.find(
                'p', class_='price_color').text)
            information.append(article.find('h3').find('a').get('title'))
            spamwriter.writerow(information)


async def main():
    with open('./test_async_python.csv', 'w') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=',')
        tasks = []
        for i in range(1, 50):
            tasks.append(asyncio.create_task(
                get_book(URL % i, spamwriter)))

        for task in tasks:
            await task

asyncio.run(main())

Python does provide the async/await terminology which makes it easier to read and write.

[Rust]

Rust, on the contrary to Python, has been built with asynchronous computation in mind. It is thread-safe and extremely efficient. The fact that the language, in its nature. is super fast makes it great for coroutines. The code might look like that:

use csv::Writer;
use select::document::Document;
use select::predicate::{Attr, Class, Name};
use std::fs::OpenOptions;

async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i);
    let response = reqwest::get(&url).await?.text().await?;
    let file = OpenOptions::new()
        .write(true)
        .create(true)
        .append(true)
        .open("test2.csv")
        .unwrap();
    let mut wtr = Writer::from_writer(file);

    let document = Document::from(response.as_str());

    for node in document.find(Name("article")) {
        let name = match node.find(Name("h3")).next() {
            Some(h3) => h3.find(Name("a")).next().unwrap().text(),
            None => "".to_string(),
        };
        let price = node
            .find(Attr("class", "price_color"))
            .next()
            .unwrap()
            .text();

        println!("{:#?} ", url);
        wtr.write_record(&[&url, &price, &name]).unwrap();
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let mut handles: std::vec::Vec<_> = Vec::new();
    for i in 1..50 {
        let job = tokio::spawn(async move { test(&i).await });
        handles.push(job);
    }

    let mut results = Vec::new();
    for job in handles {
        results.push(job.await);
    }

    Ok(())
}

‌

Performance

Performance test of scraping the 50 pages of http://books.toscrape.com/catalogue/page-1.html

Name	CPU Usage	Time(s)
Synchronous Python	5%	44.363s
Synchronous Rust	7%	55s
Async Python	63%	2.5s
Async Rust	107%	3.7s

‌

Performance are pretty similar due to the fact that web scraping is pretty much io bound

Annex

Synchronous Python code

import requests
import bs4 as bs
import csv
URL = "http://books.toscrape.com/catalogue/page-%d.html"

with open('./test_python.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    for i in range(1, 50):
        response = requests.get(URL % i)
        if response.status_code == 200:
            content = response.content
            soup = bs.BeautifulSoup(content, 'lxml')
            articles = soup.find_all('article')

            for article in articles:
                information = []
                information.append(article.find(
                    'p', class_='price_color').text)
                information.append(article.find('h3').find('a').get('title'))
                spamwriter.writerow(information)

‌

Synchronous Rust code:

use csv::Writer;
use select::document::Document;
use select::predicate::{Attr, Class, Name};
use std::fs::OpenOptions;

async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i);
    let response = reqwest::get(&url).await?.text().await?;
    let file = OpenOptions::new()
        .write(true)
        .create(true)
        .append(true)
        .open("test2.csv")
        .unwrap();
    let mut wtr = Writer::from_writer(file);

    let document = Document::from(response.as_str());

    for node in document.find(Name("article")) {
        let name = match node.find(Name("h3")).next() {
            Some(h3) => h3.find(Name("a")).next().unwrap().text(),
            None => "".to_string(),
        };
        let price = node
            .find(Attr("class", "price_color"))
            .next()
            .unwrap()
            .text();

        // println!("{:#?} ", url);
        wtr.write_record(&[&url, &price, &name]).unwrap();
    }

    Ok(())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    for i in 1..50 {
        test(&i).await.unwrap();
    }
    Ok(())
}

‌

rebelssasha July 5, 2024, 1:42 p.m.

Greetings! Are you preparing for a significant event and need inspiration for your speech? Look no further than EssayPro's extensive list of special occasion speech topics https://essaypro.com/blog/special-occasion-speech . This blog provides a wealth of ideas tailored to various events, ensuring that you have the perfect topic to captivate and engage your audience. From heartfelt tributes to humorous anecdotes, these topics are designed to help you deliver a powerful and memorable speech. Explore this blog post to find a topic that suits your occasion and personal style.

Eurico98 Sept. 9, 2022, 8:20 a.m.

Great comparison.

James Fenerson July 21, 2021, 8:19 a.m.

Hey, good article. I think you need to post more)

Zefeng Zhu June 5, 2021, 4:36 p.m.

requests is developed in the blocking I/O world. And your async function can not really give full play to the strength of python's asynchronous function. Maybe you can take a look at aiohttp or httpx.

Minenash Feb. 28, 2021, 6:11 a.m.

Hey, nice write up! One question, you say that async rust crushes async python (and this makes sense), but the chart has Python for 2.5s at 63%, and Rust for 3.7s at 107%.

Haixuan Xavier Tao Feb. 28, 2021, 4:29 p.m.

Yeah, sorry I forgot to update it. it used to be true but then someone on reddit told me to use httpx for python and it changed the winner. https://www.reddit.com/r/rust/comments/lrq8le/webscraperpythonvsrustfromabeginnerstand/ I'm going to update it.

Thanks

Joel Bernal Ortiz April 1, 2022, 2:54 p.m.

Hello, the pyhon code is also not updated right?

Error handling

[Python]

[Rust]

Async scraping

[Python]

[Rust]

Performance

Annex

Synchronous Python code

Synchronous Rust code:

Data Manipulation: Pandas vs Rust