27. Python's mechanize Library

Generally, a users can either view a website using a browser or by viewing the source code using a number of different methods and tools; the Linux program wget is a popular method. If you want to open a website using Python, the only way to browse the Internet is to retrieve and parse the website's HTML source code. In this tutorial, we'll learn how to use Mechanize Library for this purpose.
To use the mechanize library, download it's tar.gz file from. Extract the tar file and install it using python setup.py install
Mechanize's primary class, Browser, allows the manipulation of anything that can be manipulated inside a browser. Let's see an example to view source code of a website using Mechanize Library:
#!usr/bin/env python
    #Program to view source code using mechanize
    import mechanize
    def page_view(url):
    #create browser object
    browser = mechanize.Browser()
    page = browser.open(url)
    src_code = page.read()
    #print source code
    print src_code
    print "Error in browsing..."
    url = "http://www.syngress.com/"

Now, in the script mech1.py change the url to https://www.google.com. What do you see? "Error in browsing..." Now let's analyse the error closely. Remove the try & except statement from the above code and try to execute the code again. Oops! It still didn't work, but this time you will see the detailed error. You must be seeing the error message stating:

As we can see in the error, there is something about the robots.txt file. Do you know what a robots.txt file is? Using this file, any website can inform the search engines like Google, Bing etc to crawl or not to crawl any webpage. Hence, if you have a website, and you don't want Google to crawl any particular webpage(might be for internal usage), then you can specify that in the robots.txt file.
Now, coming on to the problem. So the above error is raised because the website is preventing our browser to visit their webpages. So, what should we do? We instruct our mechanize browser object to ignore the website parsing for robots file. In order to do that, simply uncomment the following line in mech1.pybrowser.set_handle_robots(False)
Now, if you visit Google.com, you can view something like below:

# Table of Contents:
28. Changing User Agent

# Ebooks for Network Programming with Python


No comments:

Post a Comment