Home > Software engineering >  Interpreting callbacks and cb_kwargs with scrapy
Interpreting callbacks and cb_kwargs with scrapy

Time:12-24

I'm in reach of a personal milestone with scrapy. The aim is to properly understand the callback and cb_kwargs, I've read the documentation countless times but I learn best with visual code, practice and an explanation.

I have an example scraper, the aim is to grab the book name, price and go into each book page and extract a single piece of information. I'm trying to understand how to properly get information on the next few pages also, which I know is dependent on understanding the operation of callbacks.

When I run my script It returns results only for the first page, how do I get the additional pages?

Here's my scraper:

class BooksItem(scrapy.Item):
    items = Field(output_processor = TakeFirst())
    price = Field(output_processor = TakeFirst())
    availability = Field(output_processor = TakeFirst())

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ['https://books.toscrape.com']

    def start_request(self):
        for url in self.start_url:
            yield scrapy.Request(
                url, 
                callback = self.parse)

    def parse(self, response):
        data = response.xpath('//div[@class = "col-sm-8 col-md-9"]')
        for books in data:
            loader = ItemLoader(BooksItem(), selector = books)
            loader.add_xpath('items','.//article[@]/h3/a//text()')
            loader.add_xpath('price','.//p[@]//text()')
            
            for url in [books.xpath('.//a//@href').get()]:
                yield scrapy.Request(
                    response.urljoin(url),
                    callback = self.parse_book,
                    cb_kwargs = {'loader':loader})

        for next_page in [response.xpath('.//div/ul[@]/li[@]/a//@href').get()]:
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)


    def parse_book(self, response, loader):
        book_quote = response.xpath('//p[@]//text()').get()
        

        loader.add_value('availability', book_quote)
        yield loader.load_item()

I believe the issue is with the part where I try to grab the next few pages. I have tried an alternative approach using the following:

def start_request(self):
        for url in self.start_url:
            yield scrapy.Request(
                url, 
                callback = self.parse,
                cb_kwargs = {'page_count':0}
)

def parse(self, response, next_page):
    if page_count > 3:
        return
...
...
    page_count  = 1    
    for next_page in [response.xpath('.//div/ul[@]/li[@]/a//@href').get()]:
        yield response.follow(next_page, callback=self.parse, cb_kwargs = {'page_count': page_count})

However, I get the following error with this approach:

TypeError: parse() missing 1 required positional argument: 'page_cntr'

CodePudding user response:

  1. It should be start_requests, and self.start_urls (inside the function).

  2. get() will return the first result, what you want is getall() in order to return a list.

  3. There is no need for a for loop for the "next_page" part, it's not a mistake just unnecessary.

  4. In the line for url in books.xpath you're getting every url twice, again not a mistake but still...

  5. Here data = response.xpath('//div[@class = "col-sm-8 col-md-9"]') you don't select the books one by one, you select the whole books container, you can check that len(data.getall()) == 1.

  6. book_quote = response.xpath('//p[@]//text()').get() will return \n, look at the source at try to find out why (hint: 'i' tag).

Compare your code to this and see what I changed:

import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class BooksItem(scrapy.Item):
    items = Field(output_processor=TakeFirst())
    price = Field(output_processor=TakeFirst())
    availability = Field(output_processor=TakeFirst())


class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ['https://books.toscrape.com']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse)

    def parse(self, response):
        data = response.xpath('//div[@class = "col-sm-8 col-md-9"]//li')
        for books in data:
            loader = ItemLoader(BooksItem(), selector=books)
            loader.add_xpath('items', './/article[@]/h3/a//text()')
            loader.add_xpath('price', './/p[@]//text()')

            for url in books.xpath('.//h3/a//@href').getall():
                yield scrapy.Request(
                    response.urljoin(url),
                    callback=self.parse_book,
                    cb_kwargs={'loader': loader})

        next_page = response.xpath('.//div/ul[@]/li[@]/a//@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_book(self, response, loader):
        # option 1:
        book_quote = response.xpath('//p[@]/i/following-sibling::text()').get().strip()

        # option 2:
        # book_quote = ''.join(response.xpath('//div[contains(@class, "product_main")]//p[@]//text()').getall()).strip()
        loader.add_value('availability', book_quote)
        yield loader.load_item()
  • Related