Home > Blockchain >  How to Scrape Product Pages using Python grequests and BeautifulSoup
How to Scrape Product Pages using Python grequests and BeautifulSoup

Time:10-01

from bs4 import BeautifulSoup
import grequests
import pandas as pd
    
# STEP 1: Create List of URLs from main archive page
def get_urls():
    urls = []
    for x in range(1,3):
        urls.append(f'http://books.toscrape.com/catalogue/page-{x}.html')
        print(f'Getting page url: {x}', urls)
    return urls

# STEP 2: Async Load HTML Content from page range in step 1
def get_data(urls):
    reqs = [grequests.get(link) for link in urls]
    print('AsyncRequest object > reqs:', reqs)
    resp = grequests.map(reqs)
    print('Status Code > resp (info on page):', resp, '\n')
    return resp

# Step 3: Extract title, author, date, url, thumb from asynch variable resp containing html elements of all scraped pages.
def parse(resp):
    productlist = []

    for r in resp:
        #print(r.request.url)
        sp = BeautifulSoup(r.text, 'lxml')
        items = sp.find_all('article', {'class': 'product_pod'})
        #print('Items:\n', items)

        for item in items:
            product = {
            'title' : item.find('h3').text.strip(),
            'price': item.find('p', {'class': 'price_color'}).text.strip(),
            'single_url': 'https://books.toscrape.com/catalogue/'   item.find(('a')).attrs['href'],
            'thumbnail': 'https://books.toscrape.com/'   item.find('img', {'class': 'thumbnail'}).attrs['src'],
            }
            productlist.append(product)
            print('Added: ', product)
            
    return productlist

urls = get_urls() # (Step 1)
resp = get_data(urls) # (Step 2)
df = pd.DataFrame(parse(resp)) # (Step 3)
df.to_csv('books.csv', index=False)

The above script works as expected by asynchronously scraping the main archive page or pages for the website https://books.toscrape.com/ using grequests and BeautifulSoup.

Within the archive page it extracts the following book information:

  • title
  • price
  • single product url
  • thumbnail url

Issue

I need a way to further extract information from the single product pages for information such as UPC and associate the information back to the main array productlist.

Single Product Page Example: https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

CodePudding user response:

The single page information you need UPC Product Type ,reviews etc... `

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
r = requests.get("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html?")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("article", class_="product_page")

header = [th.get_text(strip=True) for th in table.tr.select("th")][1:]
header.insert(0, 'S.No')

all_data = []
for row in table.select("tr:has(td)"):
    tds = [td.get_text(strip=True) for td in row.select("td")]
    all_data.append(tds)

df = pd.DataFrame(all_data, columns=header)
print(df)

output:
                      S.No
0         a897fe39b1053632
1                    Books
251.77
351.77
40.00
5  In stock (22 available)
6                        0

CodePudding user response:

If you click on a book list/card then it brings new page along with url which is navigation page and from here you want to get your desired data. In that case, it can do easily using scrapy. Here is the working solution:

CODE:

from scrapy import Spider
from scrapy.http import Request


def product_info(response, value):
    return response.xpath('//th[text()="'   value   '"]/following-sibling::td/text()').extract_first()


class BooksSpider(Spider):
    name = 'book'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/page-' str(x) '.html' for x in range(1,3)]

    def parse(self, response):
        books = response.xpath('//h3/a/@href').extract()
        for book in books:
            absolute_url = response.urljoin(book)
            yield Request(absolute_url, callback=self.parse_book)


    def parse_book(self, response):
        title = response.css('h1::text').extract_first()
        price = response.xpath('//*[@class="price_color"]/text()').extract_first()

        image_url = response.xpath('//img/@src').extract_first()
        image_url = image_url.replace('../..', 'http://books.toscrape.com/')

        rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
        rating = rating.replace('star-rating ', '')
       
        # product information data 
        upc = product_info(response, 'UPC')
        product_type =  product_info(response, 'Product Type')
        price_without_tax = product_info(response, 'Price (excl. tax)')
        price_with_tax = product_info(response, 'Price (incl. tax)')
        tax = product_info(response, 'Tax')
        availability = product_info(response, 'Availability')
        number_of_reviews = product_info(response, 'Number of reviews')

        yield {
            'title': title,
            'price': price,
            'image_url': image_url,
            'rating': rating,
            'upc': upc,
            'product_type': product_type,
            'price_without_tax': price_without_tax,
            'price_with_tax': price_with_tax,
            'tax': tax,
            'availability': availability,
            'number_of_reviews': number_of_reviews,
            'url': response.url
            }

Output:

{'title': 'Wall and Piece', 'price': '£44.18', 'image_url': 'http://books.toscrape.com//media/cache/df/34/df346322ddfdd3b4da0e34cad17f49dc.jpg', 'rating': 'Four', 'upc': 'ccd9ffa25efabdea', 'product_type': 'Books', 'price_without_tax': '£44.18', 'price_with_tax': '£44.18', 'tax': '£0.00', 'availability': 'In stock (18 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/wall-and-piece_971/index.html'}
2021-10-01 00:58:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/index.html>
{'title': 'Worlds Elsewhere: Journeys Around Shakespeare’s Globe', 'price': '£40.30', 'image_url': 'http://books.toscrape.com//media/cache/7b/d9/7bd93db091d736d0c6ff9d578e3ba3d7.jpg', 'rating': 'Five', 'upc': '4c28def39d850cdf', 'product_type': 'Books', 'price_without_tax': '£40.30', 'price_with_tax': '£40.30', 'tax': '£0.00', 'availability': 'In stock (18 available)', 
'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/worlds-elsewhere-journeys-around-shakespeares-globe_972/index.html'}
2021-10-01 00:58:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/how-music-works_979/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/in-her-wake_980/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/how-music-works_979/index.html>
{'title': 'How Music Works', 'price': '£37.32', 'image_url': 'http://books.toscrape.com//media/cache/1d/40/1d4087ff0a63f09fae9cd8433d21c2c4.jpg', 'rating': 'Two', 'upc': '327f68a59745c102', 'product_type': 'Books', 'price_without_tax': '£37.32', 'price_with_tax': '£37.32', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/how-music-works_979/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/in-her-wake_980/index.html>
{'title': 'In Her Wake', 'price': '£12.84', 'image_url': 'http://books.toscrape.com//media/cache/27/92/2792ef951651ff1eae40a410cac41e0f.jpg', 'rating': 'One', 'upc': '23356462d1320d61', 
'product_type': 'Books', 'price_without_tax': '£12.84', 'price_with_tax': '£12.84', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/in-her-wake_980/index.html'}
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html>
{'title': "It's Only the Himalayas", 'price': '£45.17', 'image_url': 'http://books.toscrape.com//media/cache/6d/41/6d418a73cc7d4ecfd75ca11d854041db.jpg', 'rating': 'Two', 'upc': 'a22124811bfa8350', 'product_type': 'Books', 'price_without_tax': '£45.17', 'price_with_tax': '£45.17', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html>
{'title': 'Libertarianism for Beginners', 'price': '£51.33', 'image_url': 'http://books.toscrape.com//media/cache/91/a4/91a46253e165d144ef5938f2d456b88f.jpg', 'rating': 'Two', 'upc': 'a18a4f574854aced', 'product_type': 'Books', 'price_without_tax': '£51.33', 'price_with_tax': '£51.33', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 
'url': 'https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html'}   
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/olio_984/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)  
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/aladdin-and-his-wonderful-lamp_973/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/black-dust_976/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/chase-me-paris-nights-2_977/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/birdsong-a-story-in-pictures_975/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/americas-cradle-of-quarterbacks-western-pennsylvanias-football-factory-from-johnny-unitas-to-joe-montana_974/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html>
{'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'price': '£37.59', 'image_url': 'http://books.toscrape.com//media/cache/e8/1f/e81f850db9b9622c65619c9f15748de7.jpg', 'rating': 'One', 'upc': 'e30f54cea9b38190', 'product_type': 'Books', 'price_without_tax': '£37.59', 'price_with_tax': '£37.59', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html>
{'title': 'Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More', 'price': '£30.52', 'image_url': 'http://books.toscrape.com//media/cache/9f/58/9f58d3ff6d58589eaf325b1a33c303a0.jpg', 'rating': 'Three', 'upc': '5674a18a29a43ced', 'product_type': 'Books', 'price_without_tax': '£30.52', 'price_with_tax': '£30.52', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/foolproof-preserving-a-guide-to-small-batch-jams-jellies-pickles-condiments-and-more-a-foolproof-guide-to-making-small-batch-jams-jellies-pickles-condiments-and-more_978/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/olio_984/index.html>
{'title': 'Olio', 'price': '£23.88', 'image_url': 'http://books.toscrape.com//media/cache/b1/0e/b10eabab1e1c811a6d47969904fd5755.jpg', 'rating': 'One', 'upc': 'feb7cc7701ecf901', 'product_type': 'Books', 'price_without_tax': '£23.88', 'price_with_tax': '£23.88', 'tax': '£0.00', 
'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/olio_984/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/aladdin-and-his-wonderful-lamp_973/index.html>
{'title': 'Aladdin and His Wonderful Lamp', 'price': '£53.13', 'image_url': 'http://books.toscrape.com//media/cache/a8/3c/a83c460fab82f35a37c0846729485547.jpg', 'rating': 'Three', 'upc': '904208d6aa64b655', 'product_type': 'Books', 'price_without_tax': '£53.13', 'price_with_tax': '£53.13', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/aladdin-and-his-wonderful-lamp_973/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/black-dust_976/index.html>
{'title': 'Black Dust', 'price': '£34.53', 'image_url': 'http://books.toscrape.com//media/cache/a4/0a/a40af95beab828af1a4757ad1ee17da3.jpg', 'rating': 'Five', 'upc': '00bfed9e18bb36f3', 
'product_type': 'Books', 'price_without_tax': '£34.53', 'price_with_tax': '£34.53', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/black-dust_976/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/chase-me-paris-nights-2_977/index.html>
{'title': 'Chase Me (Paris Nights #2)', 'price': '£25.27', 'image_url': 'http://books.toscrape.com//media/cache/6c/84/6c84fcf7a53b02b6e763de7272934842.jpg', 'rating': 'Five', 'upc': 'c2e46a2ee3b4a322', 'product_type': 'Books', 'price_without_tax': '£25.27', 'price_with_tax': '£25.27', 'tax': '£0.00', 'availability': 'In stock (19 available)', 'number_of_reviews': '0', 'url': 'https://books.toscrape.com/catalogue/chase-me-paris-nights-2_977/index.html'}
2021-10-01 00:58:03 [scrapy.core.scraper] DEBUG: Scraped from <200 https://books.toscrape.com/catalogue/birdsong-a-story-in-pictures_975/index.html>

... so on

  • Related