Urllib is the module which handles the URL in Python. Its purpose is to fetch URLs. It uses urlopen function and various other protocols in order to fetch URLs in Python for Data Science.
The Urllib is a big package consisting of several modules for working with URLs. Some such functions are:
- request is there for opening and reading.
- parse is there for parsing URLs.
- error is there when we raise exceptions.
- robotparser in order to parse robot.txt files.
In case we are not having urllib in our environment, the below code can be executed in order to install it.
Now we will look into some modules in detail.
This function defines functions and classes in order to open URLs. One simple way to open such URL is given below.
In this, we define functions which are there to manipulate URLs and their component parts. This helps to build or break them. The main focus is on splitting the URL into smaller components or joining of different URL components to form a URL string.
Some other functions of urllib.parse:
In this module, the classes for exception raise by urllib.request is define. When while fetching a URL, some error occurs, then this module will help to raise exceptions. The following exceptions are raise.
- URLError – We raise this exception for errors in URLs or while fetching the URL because of connectivity. It has a ‘reason’ property that tells the user reason for error.
- HTTPError – We raise this exception for exotic HTTP errors like authentication request error. We can call it a subclass or URLError. The usual errors are ‘404’(page not found), ‘403’ (request forbidden) and ‘401’ (requires authentication).
In this module, there is a single class, RobotFileParser. This class tells us about whether a particular user can fetch a URL that has published robot.txt files. The robot.txt file informs the web crawler which parts of the server should not be accessed.