w3resource

Python Web Scraping - Exercises, Practice, Solution

Web Scraping

Web scraping or web data extraction is data scraping used for extracting data from websites. Web scraping softwares are used to access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Python request module :

Requests allows user to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data.

Python Web Scraping [27 exercises with solution]

[An editor is available at the bottom of the page to write and execute the scripts.]

1. Write a Python program to test if a given page is found or not on the server. Go to the editor
Click me to see the sample solution

2. Write a Python program to download and display the content of robot.txt for en.wikipedia.org. Go to the editor
Click me to see the sample solution

3. Write a Python program to get the number of datasets currently listed on data.gov. Go to the editor
Click me to see the sample solution

4. Write a Python program to convert an address (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). Go to the editor
Geocodin: Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739), which you can use to place markers on a map, or position the map.
Click me to see the sample solution

5. Write a Python program to display the name of the most recently added dataset on data.gov. Go to the editor
Click me to see the sample solution

6. Write a Python program to extract h1 tag from example.com. Go to the editor
Click me to see the sample solution

7. Write a Python program to extract and display all the header tags from en.wikipedia.org/wiki/Main_Page. Go to the editor
Click me to see the sample solution

8. Write a Python program to extract and display all the image links from en.wikipedia.org/wiki/Peter_Jeffrey_(RAAF_officer). Go to the editor
Click me to see the sample solution

9. Write a Python program to get 90 days of visits broken down by browser for all sites on data.gov. Go to the editor
Click me to see the sample solution

10. Write a Python program to that retrieves an arbitary Wikipedia page of "Python" and creates a list of links on that page. Go to the editor
Click me to see the sample solution

11. Write a Python program to check whether a page contains a title or not. Go to the editor
Click me to see the sample solution

12. Write a Python program to list all language names and number of related articles in the order they appear in wikipedia.org. Go to the editor
Click me to see the sample solution

13. Write a Python program to get the number of people visiting a U.S. government website right now. Go to the editor
Source: https://analytics.usa.gov/data/live/realtime.json
Click me to see the sample solution

14. Write a Python program get the number of security alerts issued by US-CERT in the current year. Go to the editor
Source: https://www.us-cert.gov/ncas/alerts
Click me to see the sample solution

15. Write a Python program to get the number of Pinterest accounts maintained by U.S. State Department embassies and missions. Go to the editor
Source: https://www.state.gov/r/pa/ode/socialmedia/
Click me to see the sample solution

16. Write a Python program to get the number of followers of a given twitter account. Go to the editor
Click me to see the sample solution

17. Write a Python program to get the number of following on Twitter. Go to the editor
Click me to see the sample solution

18. Write a Python program to get the number of post on Twitter liked by a given account. Go to the editor
Click me to see the sample solution

19. Write a Python program to count number of tweets by a given Twitter account. Go to the editor
Click me to see the sample solution

20. Write a Python program to scrap number of tweets of a given Twitter account. Go to the editor
Click me to see the sample solution

21. Write a Python program to find the live weather report (temperature, wind speed, description and weather) of a given city. Go to the editor
Click me to see the sample solution

22. Write a Python program to display the date, days, title, city, country of next 25 Hackevents. Go to the editor
Click me to see the sample solution

23. Write a Python program to download IMDB's Top 250 data (movie name, Initial release, director name and stars). Go to the editor
Click me to see the sample solution

24. Write a Python program to get movie name, year and a brief summary of the top 10 random movies. Go to the editor
Click me to see the sample solution

25. Write a Python program to get the number of magnitude 4.5+ earthquakes detected worldwide by the USGS. Go to the editor
Click me to see the sample solution

26. Write a Python program to display the contains of different attributes like different attributes like status_code, headers, url, history, encoding, reason, cookies, elapsed, request and content of a specified resource. Go to the editor
Click me to see the sample solution

27. Write a Python program to verifiy SSL certificates for HTTPS requests using requests module. Go to the editor
Note: Requests verifies SSL certificates for HTTPS requests, just like a web browser. By default, SSL verification is enabled, and Requests will throw a SSLError if it's unable to verify the certificate
Click me to see the sample solution

More to Come !

Python Code Editor:




Python: Tips of the Day

Python: How to install pip on Windows?

Python 2.7.9+ and 3.4+

Good news! Python 3.4 (released March 2014) and Python 2.7.9 (released December 2014) ship with Pip. This is the best feature of any Python release. It makes the community's wealth of libraries accessible to everyone. Newbies are no longer excluded from using community libraries by the prohibitive difficulty of setup. In shipping with a package manager, Python joins Ruby, Node.js, Haskell, Perl, Go-almost every other contemporary language with a majority open-source community. Thank you, Python.

If you do find that pip is not available when using Python 3.4+ or Python 2.7.9+, simply execute e.g.:

py -3 -m ensurepip

Of course, that doesn't mean Python packaging is problem solved. The experience remains frustrating. I discuss this in the Stack Overflow question Does Python have a package/module management system?.

And, alas for everyone using Python 2.7.8 or earlier (a sizable portion of the community). There's no plan to ship Pip to you. Manual instructions follow.

Python 2 = 2.7.8 and Python 3 = 3.3

Flying in the face of its 'batteries included' motto, Python ships without a package manager. To make matters worse, Pip was-until recently-ironically difficult to install.

Official instructions

Per https://pip.pypa.io/en/stable/installing/#do-i-need-to-install-pip:

Download get-pip.py, being careful to save it as a .py file rather than .txt. Then, run it from the command prompt:

python get-pip.py

You possibly need an administrator command prompt to do this. Follow Start a Command Prompt as an Administrator (Microsoft TechNet).

This installs the pip package, which (in Windows) contains ...\Scripts\pip.exe that path must be in PATH environment variable to use pip from the command line (see the second part of 'Alternative Instructions' for adding it to your PATH,

Alternative instructions

The official documentation tells users to install Pip and each of its dependencies from source. That's tedious for the experienced and prohibitively difficult for newbies.

For our sake, Christoph Gohlke prepares Windows installers (.msi) for popular Python packages. He builds installers for all Python versions, both 32 and 64 bit. You need to:

  1. Install setuptools
  2. Install pip

For me, this installed Pip at C:\Python27\Scripts\pip.exe. Find pip.exe on your computer, then add its folder (for example, C:\Python27\Scripts) to your path (Start / Edit environment variables). Now you should be able to run pip from the command line. Try installing a package:

pip install httpie

There you go (hopefully)! Solutions for common problems are given below:

Proxy problems

If you work in an office, you might be behind an HTTP proxy. If so, set the environment variables http_proxy and https_proxy. Most Python applications (and other free software) respect these. Example syntax:

http://proxy_url:port
http://username:[email protected]_url:port

If you're really unlucky, your proxy might be a Microsoft NTLM proxy. Free software can't cope. The only solution is to install a free software friendly proxy that forwards to the nasty proxy. http://cntlm.sourceforge.net/

Unable to find vcvarsall.bat

Python modules can be partly written in C or C++. Pip tries to compile from source. If you don't have a C/C++ compiler installed and configured, you'll see this cryptic error message.

Error: Unable to find vcvarsall.bat

You can fix that by installing a C++ compiler such as MinGW or Visual C++. Microsoft actually ships one specifically for use with Python. Or try Microsoft Visual C++ Compiler for Python 2.7.

Often though it's easier to check Christoph's site for your package.

Ref: https://bit.ly/2B0ch3y