I'm trying to scrape an Amazon product page but scrapy is giving me inconsistent results (sometimes it returns what I want and sometimes it returns None). I have no idea as to why the same code give different results. I created a loop that yield the same request 10 times and it was giving me different results. Can anyone help me?
import scrapy
from scrapy import Request
class AmzsingleSpider(scrapy.Spider):
name = 'amzsingle'
def start_requests(self):
for i in range(10):
yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)
def parse(self, response):
yield {
'title': response.xpath('//span[@id="productTitle"]/text()').get()
}
and this is the log that I get in the terminal. This attempt gave 9 None and 1 found (some other time it was returning 7 None and 3 found):
2021-11-27 22:08:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2021-11-27 22:08:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-27 22:08:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-27 22:08:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4664,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 1508328,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'elapsed_time_seconds': 20.82323,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 11, 27, 15, 8, 45, 324091),
'httpcompression/response_bytes': 7323320,
'httpcompression/response_count': 11,
'item_scraped_count': 10,
'log_count/DEBUG': 22,
'log_count/INFO': 11,
'memusage/max': 53161984,
'memusage/startup': 53161984,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 11,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2021, 11, 27, 15, 8, 24, 500861)}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Spider closed (finished)
CodePudding user response:
This is what I get for using the CSS selector: response.css('#productTitle ::text').get()
as shown by Ikram Khan Niazi. I still get inconsistent scrape result with this CSS selector as seen in the log below:
2021-11-28 16:20:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-28 16:20:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:21:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:21:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-28 16:21:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-28 16:21:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4664,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 417628,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'elapsed_time_seconds': 22.976533,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 11, 28, 9, 21, 2, 294028),
'httpcompression/response_bytes': 1890283,
'httpcompression/response_count': 11,
'item_scraped_count': 10,
'log_count/DEBUG': 22,
'log_count/INFO': 11,
'memusage/max': 53284864,
'memusage/startup': 53284864,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 11,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2021, 11, 28, 9, 20, 39, 317495)}
2021-11-28 16:21:02 [scrapy.core.engine] INFO: Spider closed (finished)
CodePudding user response:
You can use a CSS selector.
import scrapy
from scrapy import Request
class AmzsingleSpider(scrapy.Spider):
name = 'amzsingle-parse'
def start_requests(self):
for i in range(10):
yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)
def parse(self, response):
yield {
'title': response.css('#productTitle ::text').get()
}
Output
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}