Home > database >  Scrapy: How to efficiently follow nested links with similar css selectors?
Scrapy: How to efficiently follow nested links with similar css selectors?

Time:03-11

I have something similar to following code. I know that in this example it would be possible to navigate directly to the yourself tag page, but in my application I need to go to page 1 in order to get the links to go to page 2, and I need the links from page 2 in order to get to page 3, etc. (i.e. the urls don't follow a specific pattern).

import scrapy


class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = [
        "https://quotes.toscrape.com/",
    ]

    def parse(self, response):
        links = response.css(
            'a[][href*=inspirational]::attr(href)'
        ).extract()
        for link in links:
            yield response.follow(link, self.parse_inspirational)

    def parse_inspirational(self, response):
        links = response.css('a[][href*=life]::attr(href)').extract()
        for link in links:
            yield response.follow(link, self.parse_life)

    def parse_life(self, response):
        links = response.css('a[][href*=yourself]::attr(href)').extract()
        for link in links:
            yield response.follow(link, self.parse_yourself)

    def parse_yourself(self, response):
        for resp in response.css('span[itemprop="text"]::text').extract():
            print(resp)

Since the same pattern of following a link and looking for a new css pattern is repeated 3 times, I want to write a function that would iterate over a list of css strings and recursively yield the responses. This is what I thought of, but it doesn't work. I'm expecting something that prints the same output as the original/long-version code:

def parse_recurse(self, response, css_str=None):
    links = response.css(css_str.pop(0)).extract()
    for link in links:
        yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"css_str":css_str})
        
def parse(self, response):
    css = ['a[][href*=inspirational]::attr(href)',
           'a[][href*=life]::attr(href)',
           'a[][href*=yourself]::attr(href)']
    response = self.parse_recurse(response, css_str=css)
    for resp in response.css('span[itemprop="text"]::text').extract():
        print(resp)

CodePudding user response:

You can't do response = self.parse_recurse(...) because parse_recurse yields only request, not response.

Normally function yield request and Scrapy catch it and it sends request to engine which will later send request to server, ang get response from server, and execute callback with this response.

See details in documentation: Architecture overview

You have to use start_requests to run parse_request with list css, and it should check if css is not empty. If css is not empty then yield request with callback parse_requests and with smaller css (so it runs recursion). And if css is empty then it should yield request with callback parse which will get text.

import scrapy

class SampleSpider(scrapy.Spider):
    name = "sample"
    
    start_urls = ["https://quotes.toscrape.com/"]

    road = [
        'a[][href*=inspirational]::attr(href)',
        'a[][href*=life]::attr(href)',
        'a[][href*=yourself]::attr(href)',
    ]
    
    def start_requests(self):
        """Run starting URL with full road."""
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_recurse, cb_kwargs={"road": self.road})
        
    def parse_recurse(self, response, road):
        """If road is not empty then send to parse_recurse with smaller road.
           If road is empty then send to parse."""

        first = road[0]
        rest  = road[1:]
        
        links = response.css(first).extract()
        
        if rest:
            # repeat recursion
            for link in links:
                yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"road": rest})
        else:
            # exit recursion
            for link in links:
                yield response.follow(link, callback=self.parse)
            
    def parse(self, response):
        for resp in response.css('span[itemprop="text"]::text').extract():
            print(resp)
            
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(SampleSpider)
c.start() 
  • Related