Home > Software design >  Scrapy - scraping overview-page and detail-page?
Scrapy - scraping overview-page and detail-page?

Time:11-07

i try to scrape the following site using scrapy -

It worked fine when i only scrape the information from the overview-page (like name, price, link) It returns me 1535 rows.

import scrapy

class WhiskeySpider(scrapy.Spider):
  name = "whisky"
  allowed_domains = ["whiskyshop.com"]
  start_urls = ["https://www.whiskyshop.com/scotch-whisky"]

  def parse(self, response):
    for products in response.css("div.product-item-info"):
      tmpPrice = products.css("span.price::text").get()
      if tmpPrice == None:
        tmpPrice = "Sold Out"
      else:
        tmpPrice = tmpPrice.replace("\u00a3",""),
      yield {
        "name": products.css("a.product-item-link::text").get(),
        "price": tmpPrice,
        "link": products.css("a.product-item-link").attrib["href"],
      }
    
    nextPage = response.css("a.action.next").attrib["href"]
    if nextPage != None:
      nextPage = response.urljoin(nextPage)
      yield response.follow(nextPage, callback=self.parse)

Now i also want to scrape some additional detail-information for every item (like litre, percent, area) and i would like to have one row with the 3 main-infos and the 3 detail-infos

I tried it with the following code - but this doesn´t work well:

import scrapy

class WhiskeySpider(scrapy.Spider):
  name = "whiskyDetail"
  allowed_domains = ["whiskyshop.com"]
  start_urls = ["https://www.whiskyshop.com/scotch-whisky"]

  def parse(self, response):
    for products in response.css("div.product-item-info"):
      tmpPrice = products.css("span.price::text").get()      
      tmpLink = products.css("a.product-item-link").attrib["href"]
      tmpLink = response.urljoin(tmpLink)
      
      if tmpPrice == None:
        tmpPrice = "Sold Out"
      else:
        tmpPrice = tmpPrice.replace("\u00a3",""),
      yield {
        "name": products.css("a.product-item-link::text").get(),
        "price": tmpPrice,
        "link": tmpLink,
      }

      yield scrapy.Request(url=tmpLink, callback=self.parseDetails)                    
    
    nextPage = response.css("a.action.next").attrib["href"]
    if nextPage != None:
      nextPage = response.urljoin(nextPage)
      yield response.follow(nextPage, callback=self.parse)
  
  def parseDetails(self, response):
    tmpDetails = response.css("p.product-info-size-abv span::text").getall()
    yield {
      "litre": tmpDetails[0],
      "percent": tmpDetails[1],
      "area": tmpDetails[2]
    }

The code seems to run in an endless loop In the log i see that he is retrying sometimes with 429 unknown status

2021-11-05 22:24:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/benrinnes-10-year-old-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '56.8% abv', 'area': 'Speyside'}
2021-11-05 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '48.5%% abv', 'area': 'Islay'}
2021-11-05 22:24:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/westport-21-year-old-batch-1-that-boutique-y-whisky-company> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '49.6% abv', 'area': 'Speyside'}
2021-11-05 22:24:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/benromach-40-year-old> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/monkey-shoulder-fever-tree-gift-pack> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/catalog/product/view/id/21965/s/nc-nean-organic-single-malt/category/246/> (failed 1 times): 429 Unknown Status

In the json-output both informations are not in one row (main and detail informations):

{"name": "Port Charlotte Islay Barley 2013 ", "price": ["65.00"], "link": "https://www.whiskyshop.com/port-charlotte-islay-barley-2013"},
{"name": "Bruichladdich Bere Barley 2011 ", "price": ["70.00"], "link": "https://www.whiskyshop.com/bruichladdich-bere-barley-2011"},
{"name": "Glen Grant 1950 68 Year Old ", "price": ["4,999.99"], "link": "https://www.whiskyshop.com/glen-grant-1950-68-year-old"},
{"name": "Linkwood 1981 Private Collection ", "price": ["1,250.00"], "link": "https://www.whiskyshop.com/linkwood-1981-private-collection"},
{"name": "Linkwood 1980 40 Year Old Private Collection ", "price": ["999.99"], "link": "https://www.whiskyshop.com/linkwood-1980-40-year-old-private-collection"},
{"name": "Dimensions Linkwood 2009 12 Year Old", "price": ["89.99"], "link": "https://www.whiskyshop.com/dimensions-linkwood-2009-12-year-old"},
{"name": "Dimensions Highland Park 2007 13 Year Old", "price": ["114.00"], "link": "https://www.whiskyshop.com/dimensions-highland-park-2007-13-year-old"},
{"litre": "70cl", "percent": "54.9% abv", "area": "Highland"},
{"litre": "70cl", "percent": "54.7% abv", "area": "Islay"},
{"litre": "70cl", "percent": "46% abv", "area": "Highland"},
{"litre": "70cl", "percent": "52.1% abv", "area": "Islay"},
{"litre": "70cl", "percent": "43% abv", "area": "Speyside"},
{"litre": "70cl", "percent": "43% abv", "area": "Highland"},

What i am doing wrong and how can i get the main and detail informatin in one row? (and without the retrying errors)

CodePudding user response:

You must have to download delay otherwise, blocked by 429 status code/Retrying/connection lost by otherside and so on. My settings.py file:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4

Alternative and the easiest solution to grab data in case of overview-page and detail-page is to use CrawlSpider.And I've made pagination in start_urls and you can increase or decrease range of page number whatever you need. Here each page contains 100 items.

Code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ShopSpider(CrawlSpider):
    name = 'shop'
    start_urls = ['https://www.whiskyshop.com/scotch-whisky?p=' str(x) ''for x in range(1,5)]

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//a[@]'),callback='parse', follow=False),)

    def parse(self, response):
        yield {
            'Name': response.xpath('//h1[@]/text()').get().strip(),
            'Price':response.xpath('(//span[@])[1]/text()').get(),
            'Litre':response.xpath('(//*[@]/span)[1]/text()').get(),
            'Percent':response.xpath('(//*[@]/span)[2]/text()').get(),
            'Area':response.xpath('(//*[@]/span)[3]/text()').get(),
            'LINK': response.url}
  • Related