Home > Back-end >  Why can't my scraper extract data from next page
Why can't my scraper extract data from next page

Time:12-21

Hi there I built a scraper using scrapy framework, it works on the first page perfectly but fails to get same data from next pages even after writing a code to crawl from next page. What am I getting wrong in my code. My items.py file is working fine too.

Here's my code

import scrapy
from amazonscraper.items import AmazonscraperItem
from scrapy.loader import ItemLoader


class AmazonspiderSpider(scrapy.Spider):
    name = 'amazonspider'
    allowed_domains = ['amazon.com']
    start_urls = ['https://www.amazon.com/s?i=fashion-womens-intl-ship&bbn=16225018011&rh=n:16225018011,n:1040660,n:1045024&pd_rd_r=2da30763-bfe6-4a38-b17a-77236fa718c5&pd_rd_w=JtaUW&pd_rd_wg=BtgRm&pf_rd_p=6a92dcea-e071-4bb9-866a-369bc067390d&pf_rd_r=86NBFKV4TA7CCSEVNBM7&qid=1671522114&rnid=1040660&ref=sr_pg_1']

def parse(self, response):
   
   products = response.css('div.sg-col-4-of-12')
   for product in products:
               
       l = ItemLoader(item = AmazonscraperItem(), selector = product )  
    
       l.add_css('name', 'a.a-link-normal span.a-size-base-plus')
       l.add_css('price', 'span.a-price span.a-offscreen')
       l.add_css('review', 'i.a-icon span.a-icon-alt')
     
       yield l.load_item()
       
   next_page = response.xpath('//*[@id="search"]/div[1]/div[1]/div/span[1]/div[1]/div[52]/div/div/span/a/@href').get()
   if next_page is not None:
       next_page_url = 'https://www.amazon.com'   next_page
       yield response.follow(next_page_url, callback = self.parse)
    

Here's my AmazonScraperItem

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags


class AmazonscraperItem(scrapy.Item):

      name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
      price = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
      review = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())

    

CodePudding user response:

I have fixed the issue. There was a technical error with the code. I have updated a few things. I have updated the next page selector to get the correct URL. Secondly, we don't need to append any URL while sending a request as you are using response.follow. response.follow will automatically convert the relative URL into an absolute URL. The below code is working for multiple pages (all pagination).

class AmazonspiderSpider(scrapy.Spider):
    name = 'amazonspider'
    allowed_domains = ['amazon.com']
    start_urls = ['https://www.amazon.com/s?i=fashion-womens-intl-ship&bbn=16225018011&rh=n:16225018011,n:1040660,n:1045024&pd_rd_r=2da30763-bfe6-4a38-b17a-77236fa718c5&pd_rd_w=JtaUW&pd_rd_wg=BtgRm&pf_rd_p=6a92dcea-e071-4bb9-866a-369bc067390d&pf_rd_r=86NBFKV4TA7CCSEVNBM7&qid=1671522114&rnid=1040660&ref=sr_pg_1']

    def parse(self, response):
        products = response.css('div.sg-col-4-of-12')
        for product in products:
                    
            l = ItemLoader(item = AmazonscraperItem(), selector = product )  
            
            l.add_css('name', 'a.a-link-normal span.a-size-base-plus')
            l.add_css('price', 'span.a-price span.a-offscreen')
            l.add_css('review', 'i.a-icon span.a-icon-alt')
            
            yield l.load_item()
            
        next_page = response.css('.s-pagination-next ::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)
  • Related