Build a URL Scraper in Python to Extract URLs from Webpages
URL Scraper:
Build a program that extracts URLs from a given webpage.
Input values:
User provides the URL of a webpage from which URLs need to be extracted.
Output value:
List of URLs extracted from the given webpage.
Example:
Input values: Enter the URL of the webpage: https://www.example.com Output value: URLs extracted from the webpage: URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Solution 1: URL Scraper Using requests and BeautifulSoup
This solution uses the requests library to fetch the webpage content and BeautifulSoup from bs4 to parse the HTML and extract all URLs.
Code:
import requests  # Import requests to make HTTP requests
from bs4 import BeautifulSoup  # Import BeautifulSoup for HTML parsing
def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Send a GET request to the webpage
        response = requests.get(webpage_url)
        response.raise_for_status()  # Raise an error if the request was unsuccessful
        # Parse the webpage content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all anchor tags with href attribute
        anchor_tags = soup.find_all('a', href=True)
        
        # Extract URLs from the anchor tags
        urls = [tag['href'] for tag in anchor_tags if tag['href'].startswith('http')]
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")
    except requests.exceptions.RequestException as e:
        print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs 
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports requests for making HTTP requests to fetch webpage content.
 - Imports BeautifulSoup from bs4 for parsing HTML content.
 - extract_urls(webpage_url) function:
 - Sends a GET request to the provided URL.
 - Parses the response using BeautifulSoup to find all anchor tags (<a>).
 - Extracts URLs that start with "http" to ensure they are absolute URLs.
 - Prints the list of extracted URLs.
 - Error Handling:
 - Catches and prints any exceptions related to HTTP requests.
 - Input from User:
 - Takes a URL as input from the user and calls the extract_urls() function.
 
Solution 2: URL Scraper Using urllib and re (Regular Expressions)
This solution uses the urllib library to fetch webpage content and regular expressions to extract URLs directly from the HTML.
Code:
 import urllib.request  # Import urllib to handle HTTP requests
import re  # Import re for regular expression matching
def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Open the URL and read the webpage content
        with urllib.request.urlopen(webpage_url) as response:
            html_content = response.read().decode('utf-8')  # Decode the content to a string format
        
        # Regular expression to find all URLs in the HTML content
        urls = re.findall(r'href=["\'](http[s]?://[^\s"\'<>]+)["\']', html_content)
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")
    except urllib.error.URLError as e:
        print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports urllib.request to handle HTTP requests and fetch webpage content.
 - Imports re for using regular expressions to match patterns in the HTML content.
 - extract_urls(webpage_url) function:
 - Opens the URL using urllib.request.urlopen() and reads the webpage content.
 - Uses re.findall() with a regular expression to find all URLs in the HTML content.
 - Prints the list of extracted URLs.
 - Error Handling:
 - Catches and prints any exceptions related to URL errors.
 - Input from User:
 - Takes a URL as input from the user and calls the extract_urls() function.
 
Summary:
Solution 1 (Requests and BeautifulSoup): Uses requests and BeautifulSoup for a more Pythonic approach to parse and extract URLs from HTML. This method is easier to read and maintain and is suitable for more complex HTML structures.
Solution 2 (urllib and Regular Expressions): Uses urllib and regular expressions, which is a lightweight approach that works well for simple URL extraction. However, it may not handle complex HTML structures as robustly as BeautifulSoup.
Go to:
