Home > Enterprise >  Scrapy Returns Inconsistent Results
Scrapy Returns Inconsistent Results

Time:11-28

I'm trying to scrape an Amazon product page but scrapy is giving me inconsistent results (sometimes it returns what I want and sometimes it returns None). I have no idea as to why the same code give different results. I created a loop that yield the same request 10 times and it was giving me different results. Can anyone help me?

import scrapy
from scrapy import Request

class AmzsingleSpider(scrapy.Spider):
    name = 'amzsingle'

    def start_requests(self):
        for i in range(10):
            yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)

    def parse(self, response):
        yield {
            'title': response.xpath('//span[@id="productTitle"]/text()').get()
        }

and this is the log that I get in the terminal. This attempt gave 9 None and 1 found (some other time it was returning 7 None and 3 found):

2021-11-27 22:08:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2021-11-27 22:08:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-27 22:08:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-27 22:08:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4664,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 1508328,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 11,
 'elapsed_time_seconds': 20.82323,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 11, 27, 15, 8, 45, 324091),
 'httpcompression/response_bytes': 7323320,
 'httpcompression/response_count': 11,
 'item_scraped_count': 10,
 'log_count/DEBUG': 22,
 'log_count/INFO': 11,
 'memusage/max': 53161984,
 'memusage/startup': 53161984,
 'proxies/good': 1,
 'proxies/mean_backoff': 0.0,
 'proxies/reanimated': 0,
 'proxies/unchecked': 0,
 'response_received_count': 11,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2021, 11, 27, 15, 8, 24, 500861)}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Spider closed (finished)

CodePudding user response:

This is what I get for using the CSS selector: response.css('#productTitle ::text').get() as shown by Ikram Khan Niazi. I still get inconsistent scrape result with this CSS selector as seen in the log below:

2021-11-28 16:20:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-28 16:20:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:20:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:20:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-28 16:21:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-28 16:21:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-28 16:21:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-28 16:21:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4664,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 417628,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 11,
 'elapsed_time_seconds': 22.976533,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 11, 28, 9, 21, 2, 294028),
 'httpcompression/response_bytes': 1890283,
 'httpcompression/response_count': 11,
 'item_scraped_count': 10,
 'log_count/DEBUG': 22,
 'log_count/INFO': 11,
 'memusage/max': 53284864,
 'memusage/startup': 53284864,
 'proxies/good': 1,
 'proxies/mean_backoff': 0.0,
 'proxies/reanimated': 0,
 'proxies/unchecked': 0,
 'response_received_count': 11,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2021, 11, 28, 9, 20, 39, 317495)}
2021-11-28 16:21:02 [scrapy.core.engine] INFO: Spider closed (finished)

CodePudding user response:

You can use a CSS selector.

import scrapy
from scrapy import Request

class AmzsingleSpider(scrapy.Spider):
    name = 'amzsingle-parse'

    def start_requests(self):
        for i in range(10):
            yield Request(url="https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)

    def parse(self, response):
        yield {
            'title': response.css('#productTitle ::text').get()
        }

Output

{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/¡Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
  • Related