Web scraping can be very useful, whether it's for collecting information for analytical purposes, recording statistics, offering a service that uses third-party information, or feeding a neural network and deep learning.
What is web scraping
There seems to be a widespread misunderstanding that web scraping is the same as web crawling, so let's get that out of the way first.
There is a distinct difference between the two:
Web crawling refers to the process of searching or "crawling" the web for any kind of information. This is what search engines like Google, Yahoo or Bing rely on when showing us the results of our search queries.
Web scraping refers to the process of collecting information from specific websites with predefined and tailored automated software.
Caution
While Web scraping by itself is a legitimate way to extract information from a website, depending on your usage of it, it may be deemed illegal.There are some scenarios in which you need to be cautious:
- Web scraping can be considered a denial of service attack - Sending too many requests, scraping data from a website can and will put a big load on the server, and limit the number of legitimate users trying to access the website.
- Disregard of copyright laws and Terms of Service - Since a lot of people, organizations and companies are developing web scrapers to collect information, websites like Amazon, eBay, LinkedIn, Instagram, Facebook etc. are having major problems. This is why most of them prohibit the use of scrapers on their data - requiring you to obtain written permission from them in order to collect the data.
- Web scraping can be used in an abusive manner - Scrapers can act like bots, with some frameworks even offering tools that can fill and submit forms. This can be used to automate spam and even attack websites. This is one of the reasons why CAPTCHA exists.
If you're considering making a powerful scraper, make sure to also consider the above, and abide by law and regulations.
Web scraping frameworks
Like with most technologies nowadays, there are multiple frameworks to choose from to extract information from a website. The most popular ones include JSoup, HTMLUnit, and Selenium WebDriver - we will cover JSoup in this article.
JSoup
JSoup is an open source project which provides a powerful API for data extraction. You can use it to parse HTML from URLs, files, and Strings. It can also manipulate HTML elements or attributes.
Using JSoup to parse a String
Parsing a String is the simplest way to parse using JSoup.
public class JSoupExample {
public static void main(String[] args) {
String html = "<html><head><title>Website title</title></head><body><p>Sample paragraph number 1 </p><p>Sample paragraph number 2</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc.title());
Elements paragraphs = doc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
System.out.println(paragraph.text());
}
}
This is pretty straightforward. By calling the parse()
method, we parse the input HTML into a Document object. Calling methods on this object, we can manipulate and extract data.
In our example, we first simply print out the title. Afterwards, we get all elements with the tag "p", which are all paragraphs. Then, we print out the text of each paragraph individually.
Running our method, we're greeted with:
Website title
Sample paragraph number 1
Sample paragraph number 2
Using JSoup to parse via URL
Parsing URLs is a bit different than parsing Strings, but the same principle applies:
public class JSoupExample {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.wikipedia.org").get();
Elements titles = doc.getElementsByClass("other-project");
for (Element title : titles) {
System.out.println(title.text());
}
}
}
To scrape a URL, we call the connect()
method with our URL as the parameter. Then by using get()
we obtain the HTML from our connection.
This example will yield us:
Commons Freely usable photos & more
Wikivoyage Free travel guide
Wiktionary Free dictionary
Wikibooks Free textbooks
Wikinews Free news source
Wikidata Free knowledge base
Wikiversity Free course materials
Wikiquote Free quote compendium
MediaWiki Free & open wiki application
Wikisource Free library
Wikispecies Free species directory
Meta-Wiki Community coordination & documentation
As you can see, the program scrapped all elements whose class was other-project
.
This approach perhaps has most usage and application in projects so let's look at some other examples of scraping via a URL.
All links from a URL
public void allLinksInUrl() throws IOException {
Document doc = Jsoup.connect("https://www.wikipedia.org").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
}
Running this will give us a long list:
Link : //en.wikipedia.org/
Text : English 5 678 000+ articles
Link : //ja.wikipedia.org/
Text : 日本語 1 112 000+ 記事
Link : //es.wikipedia.org/
Text : Español 1 430 000+ artículos
Link : //de.wikipedia.org/
Text : Deutsch 2 197 000+ Artikel
Link : //ru.wikipedia.org/
Text : Русский 1 482 000+ статей
Link : //it.wikipedia.org/
Text : Italiano 1 447 000+ voci
Link : //fr.wikipedia.org/
Text : Français 2 000 000+ articles
Link : //zh.wikipedia.org/
Text : 中文 1 013 000+ 條目
<!--A bunch of other languages -->
Text : Wiktionary Free dictionary
Link : //www.wikibooks.org/
Text : Wikibooks Free textbooks
Link : //www.wikinews.org/
Text : Wikinews Free news source
Link : //www.wikidata.org/
Text : Wikidata Free knowledge base
Link : //www.wikiversity.org/
Text : Wikiversity Free course materials
Link : //www.wikiquote.org/
Text : Wikiquote Free quote compendium
Link : //www.mediawiki.org/
Text : MediaWiki Free & open wiki application
Link : //www.wikisource.org/
Text : Wikisource Free library
Link : //species.wikimedia.org/
Text : Wikispecies Free species directory
Link : //meta.wikimedia.org/
Text : Meta-Wiki Community coordination & documentation
Link : https://creativecommons.org/licenses/by-sa/3.0/
Text : Creative Commons Attribution-ShareAlike License
Link : //meta.wikimedia.org/wiki/Terms_of_Use
Text : Terms of Use
Link : //meta.wikimedia.org/wiki/Privacy_policy
Text : Privacy Policy
Similar to this, you can obtain image count, meta information, form parameters, you name it - making this a great way to obtain statistic data.
Using JSoup to parse a file
public void parseFile() throws URISyntaxException, IOException {
URL path = ClassLoader.getSystemResource("page.html");
File inputFile = new File(path.toURI());
Document document = Jsoup.parse(inputFile, "UTF-8");
System.out.println(document.title());
//parse document in any way
}
By parsing a file, rather than parsing from URL, we don't send a request to the website and potentially overload servers each time we run our application. While it is true that this way has a lot more limitations and static data, which makes it unfit for a lot of tasks, it offers a more legitimate and less harmful way to parse data.
You can parse this document in any way, just like we've gone over above.
Setting Attribute Values
Just like loading and obtaining data from Strings, URLs, and files, we can also modify data and input forms.
For an example, when we visit Amazon, by clicking the logo at the top left, we are redirected to the index page of the website.
If we wish to change this behavior, we can do so easily:
public void setAttributes() throws IOException {
Document doc = Jsoup.connect("https://www.amazon.com").get();
Element element = doc.getElementById("nav-logo");
System.out.println("Element: " + element.outerHtml());
element.children().attr("href", "notamazon.org");
System.out.println("Element with set attribute: " + element.outerHtml());
}
By getting the id
of the said logo, we can observe the HTML around and inside of it. We can access its children and change their attributes too.
Element: <div id="nav-logo">
<a href="/ref=nav_logo/135-9898877-2038645" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="/gp/prime/ref=nav_logo_prime_join/135-9898877-2038645" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
Element with set attribute: <div id="nav-logo">
<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
By default, both <a>
children had attributes pointing to links respectively. After settings the attributes to our value, we can see that the href
attributes of the children were updated.
Adding or Removing Classes
In addition to setting attribute values, we can add or remove classes easily by updating our previous example:
public void changePage() throws IOException {
Document doc = Jsoup.connect("https://www.amazon.com").get();
Element element = doc.getElementById("nav-logo");
System.out.println("Original Element: " + element.outerHtml());
<!--Setting attributes -->
element.children().attr("href", "notamazon.org");
System.out.println("Element with set attribute: " + element.outerHtml());
<!--Adding classes -->
element.addClass("someClass");
System.out.println("Element with added class: " + element.outerHtml());
<!--Removing classes -->
element.removeClass("someClass");
System.out.println("Element with removed class: " + element.outerHtml());
}
Running this code will prompt us with updated information:
Original Element: <div id="nav-logo">
<a href="/ref=nav_logo/135-1285661-0204513" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="/gp/prime/ref=nav_logo_prime_join/135-1285661-0204513" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
Element with set attribute: <div id="nav-logo">
<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
Element with added class: <div id="nav-logo" class="someClass">
<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
Element with removed class: <div id="nav-logo">
<a href="notamazon.org" class="nav-logo-link" tabindex="6"> <span class="nav-logo-base nav-sprite">Amazon</span> <span class="nav-logo-ext nav-sprite"></span> <span class="nav-logo-locale nav-sprite"></span> </a>
<a href="notamazon.org" aria-label="" tabindex="7" class="nav-logo-tagline nav-sprite nav-prime-try"> Try Prime </a>
</div>
You can either save this new state as a .html page on your local machine or send it as an HTTP request to the website - though be weary with the latter since improper use can be illegal.
Conclusion
Web scraping can be useful for many purposes but should be used according to law. In this article, we've been introduced to JSoup, a popular web scraping framework, and explored different ways to parse information.