Introduction to Web Scraping with Java - JSoup

Web scraping can be very useful, whether it's for collecting information for analytical purposes, recording statistics, offering a service that uses third-party information, or feeding a neural network and deep learning. 

What is web scraping

There seems to be a widespread misunderstanding that web scraping is the same as web crawling, so let's get that out of the way first.

There is a distinct difference between the two:

Web crawling refers to the process of searching or "crawling" the web for any kind of information. This is what search engines like Google, Yahoo or Bing rely on when showing us the results of our search queries. 

Web scraping refers to the process of collecting information from specific websites with predefined and tailored automated software.

Caution

While Web scraping by itself is a legitimate way to extract information from a website, depending on your usage of it, it may be deemed illegal.
There are some scenarios in which you need to be cautious:

  • Web scraping can be considered a denial of service attack - Sending too many requests, scraping data from a website can and will put a big load on the server, and limit the number of legitimate users trying to access the website.
  • Disregard of copyright laws and Terms of Service - Since a lot of people, organizations and companies are developing web scrapers to collect information, websites like Amazon, eBay, LinkedIn, Instagram, Facebook etc. are having major problems. This is why most of them prohibit the use of scrapers on their data - requiring you to obtain written permission from them in order to collect the data.
  • Web scraping can be used in an abusive manner - Scrapers can act like bots, with some frameworks even offering tools that can fill and submit forms. This can be used to automate spam and even attack websites. This is one of the reasons why CAPTCHA exists.

If you're considering making a powerful scraper, make sure to also consider the above, and abide by law and regulations.

Web scraping frameworks

Like with most technologies nowadays, there are multiple frameworks to choose from to extract information from a website. The most popular ones include JSoup, HTMLUnit, and Selenium WebDriver - we will cover JSoup in this article. 

JSoup

JSoup is an open source project which provides a powerful API for data extraction. You can use it to parse HTML from URLs, files, and Strings. It can also manipulate HTML elements or attributes.

Using JSoup to parse a String

Parsing a String is the simplest way to parse using JSoup.

public class JSoupExample {

public static void main(String[] args) {

String html = "<html><head><title>Website title</title></head><body><p>Sample paragraph number 1 </p><p>Sample paragraph number 2</p></body></html>";

Document doc = Jsoup.parse(html);

System.out.println(doc.title());

Elements paragraphs = doc.getElementsByTag("p");

for (Element paragraph : paragraphs) {

System.out.println(paragraph.text());

}

}

This is pretty straightforward. By calling the parse() method, we parse the input HTML into a Document object. Calling methods on this object, we can manipulate and extract data.

In our example, we first simply print out the title. Afterwards, we get all elements with the tag "p", which are all paragraphs. Then, we print out the text of each paragraph individually.

Running our method, we're greeted with: 

Website title

Sample paragraph number 1

Sample paragraph number 2

Using JSoup to parse via URL

Parsing URLs is a bit different than parsing Strings, but the same principle applies:

public class JSoupExample {

public static void main(String[] args) throws IOException {

Document doc = Jsoup.connect("https://www.wikipedia.org").get();

Elements titles = doc.getElementsByClass("other-project");

for (Element title : titles) {

System.out.println(title.text());

}

}

}

To scrape a URL, we call the connect() method with our URL as the parameter. Then by using get() we obtain the HTML from our connection. 
This example will yield us:

Commons Freely usable photos & more

Wikivoyage Free travel guide

Wiktionary Free dictionary

Wikibooks Free textbooks

Wikinews Free news source

Wikidata Free knowledge base

Wikiversity Free course materials

Wikiquote Free quote compendium

MediaWiki Free & open wiki application

Wikisource Free library

Wikispecies Free species directory

Meta-Wiki Community coordination & documentation

As you can see, the program scrapped all elements whose class was other-project.


This approach perhaps has most usage and application in projects so let's look at some other examples of scraping via a URL.

All links from a URL

 public void allLinksInUrl() throws IOException {

Document doc = Jsoup.connect("https://www.wikipedia.org").get();

Elements links = doc.select("a[href]");

for (Element link : links) {

System.out.println("\nlink : " + link.attr("href"));

System.out.println("text : " + link.text());

}

}

Running this will give us a long list:

Link : //en.wikipedia.org/

Text : English 5 678 000+ articles

Link : //ja.wikipedia.org/

Text : 日本語 1 112 000+ 記事

Link : //es.wikipedia.org/

Text : Español 1 430 000+ artículos

Link : //de.wikipedia.org/

Text : Deutsch 2 197 000+ Artikel

Link : //ru.wikipedia.org/

Text : Русский 1 482 000+ статей

Link : //it.wikipedia.org/

Text : Italiano 1 447 000+ voci

Link : //fr.wikipedia.org/

Text : Français 2 000 000+ articles

Link : //zh.wikipedia.org/

Text : 中文 1 013 000+ 條目

<!--A bunch of other languages -->


Text : Wiktionary Free dictionary

Link : //www.wikibooks.org/

Text : Wikibooks Free textbooks

Link : //www.wikinews.org/

Text : Wikinews Free news source

Link : //www.wikidata.org/

Text : Wikidata Free knowledge base

Link : //www.wikiversity.org/

Text : Wikiversity Free course materials

Link : //www.wikiquote.org/

Text : Wikiquote Free quote compendium

Link : //www.mediawiki.org/

Text : MediaWiki Free & open wiki application

Link : //www.wikisource.org/

Text : Wikisource Free library

Link : //species.wikimedia.org/

Text : Wikispecies Free species directory

Link : //meta.wikimedia.org/

Text : Meta-Wiki Community coordination & documentation

Link : https://creativecommons.org/licenses/by-sa/3.0/

Text : Creative Commons Attribution-ShareAlike License

Link : //meta.wikimedia.org/wiki/Terms_of_Use

Text : Terms of Use

Link : //meta.wikimedia.org/wiki/Privacy_policy

Text : Privacy Policy

Similar to this, you can obtain image count, meta information, form parameters, you name it - making this a great way to obtain statistic data.

Using JSoup to parse a file

public void parseFile() throws URISyntaxException, IOException {

URL path = ClassLoader.getSystemResource("page.html");

File inputFile = new File(path.toURI());

Document document = Jsoup.parse(inputFile, "UTF-8");

System.out.println(document.title());

//parse document in any way

}

By parsing a file, rather than parsing from URL, we don't send a request to the website and potentially overload servers each time we run our application. While it is true that this way has a lot more limitations and static data, which makes it unfit for a lot of tasks, it offers a more legitimate and less harmful way to parse data.

You can parse this document in any way, just like we've gone over above.

Setting Attribute Values

Just like loading and obtaining data from Strings, URLs, and files, we can also modify data and input forms.

For an example, when we visit Amazon, by clicking the logo at the top left, we are redirected to the index page of the website.

If we wish to change this behavior, we can do so easily:

 public void setAttributes() throws IOException {

Document doc = Jsoup.connect("https://www.amazon.com").get();

Element element = doc.getElementById("nav-logo");

System.out.println("Element: " + element.outerHtml());

element.children().attr("href", "notamazon.org");

System.out.println("Element with set attribute: " + element.outerHtml());

}

By getting the id of the said logo, we can observe the HTML around and inside of it. We can access its children and change their attributes too.

Element: <div id="nav-logo">

<a href="/ref=nav_logo/135-9898877-2038645" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="/gp/prime/ref=nav_logo_prime_join/135-9898877-2038645" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>


Element with set attribute: <div id="nav-logo">

<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>

By default, both <a> children had attributes pointing to links respectively. After settings the attributes to our value, we can see that the href attributes of the children were updated.

Adding or Removing Classes

In addition to setting attribute values, we can add or remove classes easily by updating our previous example:

public void changePage() throws IOException {

Document doc = Jsoup.connect("https://www.amazon.com").get();

Element element = doc.getElementById("nav-logo");

System.out.println("Original Element: " + element.outerHtml());

<!--Setting attributes -->

element.children().attr("href", "notamazon.org");

System.out.println("Element with set attribute: " + element.outerHtml());

<!--Adding classes -->

element.addClass("someClass");

System.out.println("Element with added class: " + element.outerHtml());

<!--Removing classes -->

element.removeClass("someClass");

System.out.println("Element with removed class: " + element.outerHtml());

}

Running this code will prompt us with updated information:

Original Element: <div id="nav-logo">

<a href="/ref=nav_logo/135-1285661-0204513" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="/gp/prime/ref=nav_logo_prime_join/135-1285661-0204513" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>


Element with set attribute: <div id="nav-logo">

<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>


Element with added class: <div id="nav-logo" class="someClass">

<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>


Element with removed class: <div id="nav-logo">

<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>

<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>

</div>

You can either save this new state as a .html page on your local machine or send it as an HTTP request to the website - though be weary with the latter since improper use can be illegal.

Conclusion

Web scraping can be useful for many purposes but should be used according to law. In this article, we've been introduced to JSoup, a popular web scraping framework, and explored different ways to parse information.