w3resource

Java thread Programming - Simultaneous Website Crawling

Java Thread: Exercise-6 with Solution

Write a Java program to implement a concurrent web crawler that crawls multiple websites simultaneously using threads.

Note:

jsoup: Java HTML Parser

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download and install jsoup from here.

Sample Solution:

Java Code:

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Web_Crawler {
  private static final int MAX_DEPTH = 2; // Maximum depth for crawling
  private static final int MAX_THREADS = 4; // Maximum number of threads

  private final Set < String > visitedUrls = new HashSet < > ();

  public void crawl(String url, int depth) {
    if (depth > MAX_DEPTH || visitedUrls.contains(url)) {
      return;
    }

    visitedUrls.add(url);
    System.out.println("Crawling: " + url);

    try {
      Document document = Jsoup.connect(url).get();
      processPage(document);

      Elements links = document.select("a[href]");
      for (Element link: links) {
        String nextUrl = link.absUrl("href");
        crawl(nextUrl, depth + 1);
      }
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  public void processPage(Document document) {
    // Process the web page content as needed
    System.out.println("Processing: " + document.title());
  }

  public void startCrawling(String[] seedUrls) {
    ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);

    for (String url: seedUrls) {
      executor.execute(() -> crawl(url, 0));
    }

    executor.shutdown();

    try {
      executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

    System.out.println("Crawling completed.");
  }

  public static void main(String[] args) {
    // Add URLs here
    String[] seedUrls = {
      "https://example.com",
      "https://www.wikipedia.org"
    };

    Web_Crawler webCrawler = new Web_Crawler();
    webCrawler.startCrawling(seedUrls);
  }
}

Sample Output:

 Crawling: https://www.wikipedia.org
Crawling: https://example.com
Processing: Wikipedia
Crawling: https://en.wikipedia.org/
Processing: Example Domain
Crawling: https://www.iana.org/domains/example
Processing: Wikipedia, the free encyclopedia
Crawling: https://en.wikipedia.org/wiki/Main_Page#bodyContent
Processing: Wikipedia, the free encyclopedia
Crawling: https://en.wikipedia.org/wiki/Main_Page
Processing: Wikipedia, the free encyclopedia
Crawling: https://en.wikipedia.org/wiki/Wikipedia:Contents
Processing: Wikipedia:Contents - Wikipedia
Crawling: https://en.wikipedia.org/wiki/Portal:Current_events
Processing: Portal:Current events - Wikipedia
Crawling: https://en.wikipedia.org/wiki/Special:Random
Processing: IANA-managed Reserved Domains
Crawling: http://www.iana.org/
Processing: Papilio birchallii - Wikipedia
Crawling: https://en.wikipedia.org/wiki/Wikipedia:About
Processing: Wikipedia:About - Wikipedia
Crawling: https://en.wikipedia.org/wiki/Wikipedia:Contact_us
Processing: Internet Assigned Numbers Authority
Crawling: http://www.iana.org/domains
Processing: Wikipedia:Contact us - Wikipedia
Crawling: https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
Processing: Domain Name Services
Crawling: http://www.iana.org/protocols
Processing: Make your donation now - Wikimedia Foundation
Crawling: https://en.wikipedia.org/wiki/Help:Contents
Processing: Help:Contents - Wikipedia
Crawling: https://en.wikipedia.org/wiki/Help:Introduction
Processing: Help:Introduction - Wikipedia

Pictorial Presentation:

Java thread Programming - Simultaneous Website Crawling

Explanation:

In the above exercise,

  • The Web_Crawler class crawls web pages. It has two constants:
  • MAX_DEPTH: Represents the maximum depth to which the crawler explores links on a web page.
  • MAX_THREADS: Represents the maximum number of threads to use for crawling.
  • The class maintains a Set<String> called visitedUrls to keep track of the URLs visited during crawling.
  • The crawl(String url, int depth) method crawls a given URL up to a specified depth. If the current depth exceeds MAX_DEPTH or if the URL has already been visited, the method returns. Otherwise, it adds the URL to the visitedUrls set. It prints a message indicating that the URL is being crawled, and retrieves the web page using the Jsoup library.
  • The processPage(Document document) method represents web page processing. In this example, it simply prints the document title. You can customize this method to perform specific operations on web page content.
  • The startCrawling(String[] seedUrls) method initiates the crawling process. It creates a fixed-size thread pool using ExecutorService and Executors.newFixedThreadPool() with a maximum number of threads specified by MAX_THREADS. It then submits crawl tasks for each seed URL in the seedUrls array to the thread pool for concurrent execution.
  • After submitting all the tasks, the method shuts down the executor, waits for all the tasks to complete using executor.awaitTermination(), and prints a completion message.

Flowchart:

Flowchart: Java thread Programming - Simultaneous Website Crawling.
Flowchart: Java thread Programming - Simultaneous Website Crawling.

Java Code Editor:

Improve this sample solution and post your code through Disqus

Previous: Multithreaded Java Program: Sum of Prime Numbers.
Next: Concurrent Bank Account in Java: Thread-Safe Deposits and Withdrawals.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.