w3resource

Python Project - Basic URL Crawler for extract URLs

Basic URL Crawler:

Develop a program that crawls a website and extracts URLs.

Input values:

  • Starting URL: The URL from which the crawler will start.
  • Depth (optional): The number of levels the crawler will follow links from the starting URL.
  • Optional Parameters:
    • Domain restriction: Whether to restrict crawling to the same domain as the starting URL.
    • File types to include or exclude (e.g., only HTML pages).

Output value:

  • Extracted URLs: A list of URLs found during the crawling process.
  • Status Messages:
    • Progress updates.
    • Error messages if the crawling fails (e.g., invalid URL, network issues).

Example:

Example 1: Basic Crawling from a Starting URL

Input:
•	Starting URL: http://example.com
Output:
•	List of extracted URLs:
 
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
Example Console Output:
 
Starting URL: http://example.com
Crawling depth: 1
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling completed.

Example 2: Crawling with Depth Restriction
Input:
•	Starting URL: http://example.com
•	Depth: 2
Output:
•	List of extracted URLs:
 
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
http://example.com/page1/subpage1
http://example.com/page2/subpage2
Example Console Output:
 
Starting URL: http://example.com
Crawling depth: 2
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling http://example.com/page1...
Found URL: http://example.com/page1/subpage1
Crawling http://example.com/page2...
Found URL: http://example.com/page2/subpage2
Crawling completed.

Example 3: Domain Restriction
Input:
•	Starting URL: http://example.com
•	Domain restriction: Yes
Output:
•	List of extracted URLs (only from the same domain):
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
Example Console Output:
Starting URL: http://example.com
Crawling depth: 1
Domain restriction: Yes
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling completed.

Here are two different solutions for building a basic URL crawler that crawls a website and extracts URLs. The first solution uses the requests and BeautifulSoup libraries to perform a simple crawl, while the second solution uses the Scrapy framework for more advanced crawling capabilities.

Prerequisites for Both Solutions:

Install Required Python Libraries:

pip install requests beautifulsoup4 scrapy

Solution: Basic URL Crawler Using requests and BeautifulSoup

This solution uses the requests library to fetch web pages and BeautifulSoup to parse the HTML content and extract URLs.

Code:

# Solution 1: Basic URL Crawler Using 'requests' and 'BeautifulSoup'

import requests  # Library for making HTTP requests
from bs4 import BeautifulSoup  # Library for parsing HTML content
from urllib.parse import urljoin, urlparse  # Functions to handle URL joining and parsing

def crawl_website(starting_url, depth=1, domain_restriction=True):
    """Crawl a website starting from a given URL to extract URLs."""
    # Set to store all discovered URLs
    visited_urls = set()

    # Helper function to recursively crawl the website
    def crawl(url, current_depth):
        """Recursively crawl the website up to the specified depth."""
        if current_depth > depth:  # Stop crawling if the maximum depth is reached
            return
        print(f"Crawling {url}...")

        try:
            # Send a GET request to the URL
            response = requests.get(url)
            # Parse the content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Iterate over all <a> tags to find URLs
            for link in soup.find_all('a', href=True):
                # Resolve the full URL
                full_url = urljoin(url, link['href'])
                # Check domain restriction
                if domain_restriction and urlparse(full_url).netloc != urlparse(starting_url).netloc:
                    continue

                # Add the discovered URL to the set
                if full_url not in visited_urls:
                    print(f"Found URL: {full_url}")
                    visited_urls.add(full_url)
                    # Recursively crawl the discovered URL
                    crawl(full_url, current_depth + 1)

        except requests.RequestException as e:
            print(f"Error crawling {url}: {e}")

    # Start crawling from the starting URL
    crawl(starting_url, 1)
    print("Crawling completed.")
    return visited_urls

# Example usage
starting_url = "https://www.python.org"
crawled_urls = crawl_website(starting_url, depth=2, domain_restriction=True)
print("Extracted URLs:", crawled_urls)  

Output:

Crawling https://www.python.org...
Found URL: https://www.python.org#content
Crawling https://www.python.org#content...
Found URL: https://www.python.org#python-network
Found URL: https://www.python.org/
Found URL: https://www.python.org/psf/
Found URL: https://www.python.org/jobs/
Found URL: https://www.python.org/community-landing/
Found URL: https://www.python.org#top
Found URL: https://www.python.org#site-map
Found URL: https://www.python.org
Found URL: https://www.python.org/community/irc/
Found URL: https://www.python.org/about/
Found URL: https://www.python.org/about/apps/
Found URL: https://www.python.org/about/quotes/
Found URL: https://www.python.org/about/gettingstarted/
Found URL: https://www.python.org/about/help/
Found URL: https://www.python.org/downloads/
Found URL: https://www.python.org/downloads/source/
Found URL: https://www.python.org/downloads/windows/
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

Explanation:

  • Function crawl_website:
    • Takes a starting URL, depth, and domain restriction as inputs to control the crawling process.
    • Uses a recursive helper function crawl to navigate through the website up to the specified depth.
  • Recursive Crawling:
    • For each URL, it sends a GET request to fetch the HTML content, uses BeautifulSoup to parse the HTML, and iterates over all <a> tags to find links.
    • Resolves relative URLs to absolute URLs using urljoin.
    • Adds each new URL to a set to avoid duplicates and recursively crawls it if within the domain and depth limits.
  • Domain Restriction and Error Handling:
    • Checks if the domain restriction is enabled to restrict crawling to the same domain.
    • Handles errors using requests.RequestException to manage network issues.


Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://www.w3resource.com/projects/python/python-basic-url-crawler-project.php