23. Implementing Web Scraping

Let's try to extract some data from the e-commerce giant, Amazon. Let's search for "Protien Bars" and related products, and then we will scrape data from the search results that we get.

Above we have a screenshot of the webpage, with the search results. Now the first step will be to indentify the HTML tag which holds the data that we want to scrape.
  • For Item Name: Right click on Product Name → Inspect element
  • For Item Price: Right click on Product Price → Inspect element
For item price below HTML tag has been used:
<span class="a-size-base a-color-price a-text-bold">
So, inside span tag we have to look for class attribute with value a-size-base a-color-price a-text-bold.
Similarly for item name, following HTML tag is used:
<h2 class="a-size-medium s-inline s-access-title a-text-normal" ...>
    ITEM_NAME
    </h2>
Now, let's code write the program/script for extracting the data.

Program/Script for Web Scraping

#!usr/bin/env python
    
    
    import requests
    from bs4 import BeautifulSoup
    
    # url of the search page
    url = "http://www.amazon.in/s/ref=nb_sb_ss_i_4_8?url=search-alias%3Daps&field-keywords=protein+bars&sprefix=protein+%2Caps%2C718&crid=1SW4WFJE8O22T&rh=i%3Aaps%2Ck%3Aprotein+bars"
    
    
    r = requests.get(url)            # get the search url using requests
    soup = BeautifulSoup(r.content)    # create a BeautifulSoup object 'soup' of the content
    
    # Item Name
    i_name = soup.find_all("h2",{"class": "a-size-medium s-inline  s-access-title  a-text-normal"})
    
    #'find_all' method is used to find the  matching criteria as mentioned in parenthesis
    
    # Item Price
    i_price = soup.find_all("span",{"class": "a-size-base a-color-price a-text-bold"})
    
    
    # Now print Item name and price
    # 'zip' is used to traverse parallely to both name and price
    for name,price  in zip(i_name,i_price):
    print "Item Name: " +name.string
    print "Item Price:" +price.text
    print '-'*70
Covering all the technicalities and features of BeautifulSoup module in a single tutorial is impossible. So, we will recommend you to read official documentation here.

Note: Here you might get confused as the price of some products are not getting displayed correctly. This is because the class name which we have used here for price extraction is different for some items(which are in offer). So you need to change the class name for such items.


So now you know how to scrape data from any website. Although BeautifulSoup module does provide a lot of other functionalities too, but using the above script/program, you can easily scrape data from any website.
Remember the 2 steps: Identify the HTML tag and then use the program to scrape.
# Table of Contents:
28. Changing User Agent

# Ebooks for Network Programming with Python



logoblog

No comments:

Post a Comment