i try to scrape the following site using scrapy -
It worked fine when i only scrape the information from the overview-page (like name, price, link) It returns me 1535 rows.
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whisky"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": products.css("a.product-item-link").attrib["href"],
}
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
Now i also want to scrape some additional detail-information for every item (like litre, percent, area) and i would like to have one row with the 3 main-infos and the 3 detail-infos
I tried it with the following code - but this doesn´t work well:
import scrapy
class WhiskeySpider(scrapy.Spider):
name = "whiskyDetail"
allowed_domains = ["whiskyshop.com"]
start_urls = ["https://www.whiskyshop.com/scotch-whisky"]
def parse(self, response):
for products in response.css("div.product-item-info"):
tmpPrice = products.css("span.price::text").get()
tmpLink = products.css("a.product-item-link").attrib["href"]
tmpLink = response.urljoin(tmpLink)
if tmpPrice == None:
tmpPrice = "Sold Out"
else:
tmpPrice = tmpPrice.replace("\u00a3",""),
yield {
"name": products.css("a.product-item-link::text").get(),
"price": tmpPrice,
"link": tmpLink,
}
yield scrapy.Request(url=tmpLink, callback=self.parseDetails)
nextPage = response.css("a.action.next").attrib["href"]
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield response.follow(nextPage, callback=self.parse)
def parseDetails(self, response):
tmpDetails = response.css("p.product-info-size-abv span::text").getall()
yield {
"litre": tmpDetails[0],
"percent": tmpDetails[1],
"area": tmpDetails[2]
}
The code seems to run in an endless loop In the log i see that he is retrying sometimes with 429 unknown status
2021-11-05 22:24:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/benrinnes-10-year-old-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '56.8% abv', 'area': 'Speyside'}
2021-11-05 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/bruichladdich-28-year-old-batch-19-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '48.5%% abv', 'area': 'Islay'}
2021-11-05 22:24:17 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/westport-21-year-old-batch-1-that-boutique-y-whisky-company> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company> (referer: https://www.whiskyshop.com/scotch-whisky)
2021-11-05 22:24:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.whiskyshop.com/strathmill-22-year-old-batch-7-that-boutique-y-whisky-company>
{'litre': '50cl', 'percent': '49.6% abv', 'area': 'Speyside'}
2021-11-05 22:24:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/benromach-40-year-old> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/monkey-shoulder-fever-tree-gift-pack> (failed 1 times): 429 Unknown Status
2021-11-05 22:24:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.whiskyshop.com/catalog/product/view/id/21965/s/nc-nean-organic-single-malt/category/246/> (failed 1 times): 429 Unknown Status
In the json-output both informations are not in one row (main and detail informations):
{"name": "Port Charlotte Islay Barley 2013 ", "price": ["65.00"], "link": "https://www.whiskyshop.com/port-charlotte-islay-barley-2013"},
{"name": "Bruichladdich Bere Barley 2011 ", "price": ["70.00"], "link": "https://www.whiskyshop.com/bruichladdich-bere-barley-2011"},
{"name": "Glen Grant 1950 68 Year Old ", "price": ["4,999.99"], "link": "https://www.whiskyshop.com/glen-grant-1950-68-year-old"},
{"name": "Linkwood 1981 Private Collection ", "price": ["1,250.00"], "link": "https://www.whiskyshop.com/linkwood-1981-private-collection"},
{"name": "Linkwood 1980 40 Year Old Private Collection ", "price": ["999.99"], "link": "https://www.whiskyshop.com/linkwood-1980-40-year-old-private-collection"},
{"name": "Dimensions Linkwood 2009 12 Year Old", "price": ["89.99"], "link": "https://www.whiskyshop.com/dimensions-linkwood-2009-12-year-old"},
{"name": "Dimensions Highland Park 2007 13 Year Old", "price": ["114.00"], "link": "https://www.whiskyshop.com/dimensions-highland-park-2007-13-year-old"},
{"litre": "70cl", "percent": "54.9% abv", "area": "Highland"},
{"litre": "70cl", "percent": "54.7% abv", "area": "Islay"},
{"litre": "70cl", "percent": "46% abv", "area": "Highland"},
{"litre": "70cl", "percent": "52.1% abv", "area": "Islay"},
{"litre": "70cl", "percent": "43% abv", "area": "Speyside"},
{"litre": "70cl", "percent": "43% abv", "area": "Highland"},
What i am doing wrong and how can i get the main and detail informatin in one row? (and without the retrying errors)
CodePudding user response:
You must have to download delay otherwise, blocked by 429 status code/Retrying/connection lost by otherside and so on. My settings.py file:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 4
Alternative and the easiest solution to grab data in case of overview-page and detail-page is to use CrawlSpider
.And I've made pagination in start_urls and you can increase or decrease range of page number whatever you need. Here each page contains 100 items.
Code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ShopSpider(CrawlSpider):
name = 'shop'
start_urls = ['https://www.whiskyshop.com/scotch-whisky?p=' str(x) ''for x in range(1,5)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@]'),callback='parse', follow=False),)
def parse(self, response):
yield {
'Name': response.xpath('//h1[@]/text()').get().strip(),
'Price':response.xpath('(//span[@])[1]/text()').get(),
'Litre':response.xpath('(//*[@]/span)[1]/text()').get(),
'Percent':response.xpath('(//*[@]/span)[2]/text()').get(),
'Area':response.xpath('(//*[@]/span)[3]/text()').get(),
'LINK': response.url}