Java thread Programming - Simultaneous Website Crawling
Java Thread: Exercise-6 with Solution
Write a Java program to implement a concurrent web crawler that crawls multiple websites simultaneously using threads.
Note:
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download and install jsoup from here.
Sample Solution:
Java Code:
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Web_Crawler {
private static final int MAX_DEPTH = 2; // Maximum depth for crawling
private static final int MAX_THREADS = 4; // Maximum number of threads
private final Set < String > visitedUrls = new HashSet < > ();
public void crawl(String url, int depth) {
if (depth > MAX_DEPTH || visitedUrls.contains(url)) {
return;
}
visitedUrls.add(url);
System.out.println("Crawling: " + url);
try {
Document document = Jsoup.connect(url).get();
processPage(document);
Elements links = document.select("a[href]");
for (Element link: links) {
String nextUrl = link.absUrl("href");
crawl(nextUrl, depth + 1);
}
} catch (IOException e) {
e.printStackTrace();
}
}
public void processPage(Document document) {
// Process the web page content as needed
System.out.println("Processing: " + document.title());
}
public void startCrawling(String[] seedUrls) {
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
for (String url: seedUrls) {
executor.execute(() -> crawl(url, 0));
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Crawling completed.");
}
public static void main(String[] args) {
// Add URLs here
String[] seedUrls = {
"https://example.com",
"https://www.wikipedia.org"
};
Web_Crawler webCrawler = new Web_Crawler();
webCrawler.startCrawling(seedUrls);
}
}
Sample Output:
Crawling: https://www.wikipedia.org Crawling: https://example.com Processing: Wikipedia Crawling: https://en.wikipedia.org/ Processing: Example Domain Crawling: https://www.iana.org/domains/example Processing: Wikipedia, the free encyclopedia Crawling: https://en.wikipedia.org/wiki/Main_Page#bodyContent Processing: Wikipedia, the free encyclopedia Crawling: https://en.wikipedia.org/wiki/Main_Page Processing: Wikipedia, the free encyclopedia Crawling: https://en.wikipedia.org/wiki/Wikipedia:Contents Processing: Wikipedia:Contents - Wikipedia Crawling: https://en.wikipedia.org/wiki/Portal:Current_events Processing: Portal:Current events - Wikipedia Crawling: https://en.wikipedia.org/wiki/Special:Random Processing: IANA-managed Reserved Domains Crawling: http://www.iana.org/ Processing: Papilio birchallii - Wikipedia Crawling: https://en.wikipedia.org/wiki/Wikipedia:About Processing: Wikipedia:About - Wikipedia Crawling: https://en.wikipedia.org/wiki/Wikipedia:Contact_us Processing: Internet Assigned Numbers Authority Crawling: http://www.iana.org/domains Processing: Wikipedia:Contact us - Wikipedia Crawling: https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en Processing: Domain Name Services Crawling: http://www.iana.org/protocols Processing: Make your donation now - Wikimedia Foundation Crawling: https://en.wikipedia.org/wiki/Help:Contents Processing: Help:Contents - Wikipedia Crawling: https://en.wikipedia.org/wiki/Help:Introduction Processing: Help:Introduction - Wikipedia
Pictorial Presentation:
Explanation:
In the above exercise,
- The Web_Crawler class crawls web pages. It has two constants:
- MAX_DEPTH: Represents the maximum depth to which the crawler explores links on a web page.
- MAX_THREADS: Represents the maximum number of threads to use for crawling.
- The class maintains a Set<String> called visitedUrls to keep track of the URLs visited during crawling.
- The crawl(String url, int depth) method crawls a given URL up to a specified depth. If the current depth exceeds MAX_DEPTH or if the URL has already been visited, the method returns. Otherwise, it adds the URL to the visitedUrls set. It prints a message indicating that the URL is being crawled, and retrieves the web page using the Jsoup library.
- The processPage(Document document) method represents web page processing. In this example, it simply prints the document title. You can customize this method to perform specific operations on web page content.
- The startCrawling(String[] seedUrls) method initiates the crawling process. It creates a fixed-size thread pool using ExecutorService and Executors.newFixedThreadPool() with a maximum number of threads specified by MAX_THREADS. It then submits crawl tasks for each seed URL in the seedUrls array to the thread pool for concurrent execution.
- After submitting all the tasks, the method shuts down the executor, waits for all the tasks to complete using executor.awaitTermination(), and prints a completion message.
Flowchart:
Java Code Editor:
Improve this sample solution and post your code through Disqus
Previous: Multithreaded Java Program: Sum of Prime Numbers.
Next: Concurrent Bank Account in Java: Thread-Safe Deposits and Withdrawals.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://www.w3resource.com/java-exercises/thread/java-thread-exercise-6.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics