Python Web Scraping Exercises

Python Web Scraping: Download and display the content of robot.txt for en.wikipedia.org

Python Web Scraping: Exercise-2 with Solution

Write a Python program to download and display the content of robot.txt for en.wikipedia.org.

Sample Solution:-

Python Code:

import requests
response = requests.get("https://en.wikipedia.org/robots.txt")
test = response.text
print("robots.txt for http://www.wikipedia.org/")


robots.txt for http://www.wikipedia.org/
# robots.txt for http://www.wikipedia.org/ and friends
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot

User-agent: Orthogaffe

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /
Disallow: /wiki/Wikipedia:Article_Incubator
Disallow: /wiki/Wikipedia%3AArticle_Incubator
Disallow: /wiki/Wikipedia_talk:Article_Incubator
Disallow: /wiki/Wikipedia_talk%3AArticle_Incubator
Disallow: /wiki/Category:Noindexed_pages
Disallow: /wiki/Category%3ANoindexed_pages


Python Web Scraping Flowchart: Test if a given page is found or not on the server

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Python program to test if a given page is found or not on the server.
Next: Write a Python program to get the number of datasets currently listed on data.gov.

What is the difficulty level of this exercise?

Python: Tips of the Day

Python: The Zip() Function

>>> students = ('John', 'Mary', 'Mike')
>>> ages = (15, 17, 16)
>>> scores = (90, 88, 82, 17, 14)
>>> for student, age, score in zip(students, ages, scores):
...     print(f'{student}, age: {age}, score: {score}')
John, age: 15, score: 90
Mary, age: 17, score: 88
Mike, age: 16, score: 82
>>> zipped = zip(students, ages, scores)
>>> a, b, c = zip(*zipped)
>>> print(b)
(15, 17, 16)