How to use Scrapy to do pagination and visit all links found on each page-CodePudding

I have the following spider and I try to combine Pagination and Rules for visiting links on each page.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/1/']

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=True),
    )

    def parse(self, response):
        
        # just get all the text 
        all_text = response.xpath("//text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url
        }
        
        # visit next page 
        # next_page_url = response.xpath('//a[@]').extract_first()

        # if next_page_url is not None:
            # yield scrapy.Request(response.urljoin(next_page_url))

I would like to implement the following behavior:

Start with page 1 https://ausschreibungen-deutschland.de/1/, visit all 10 links and get the text. (already implemented)

Go to page 2 https://ausschreibungen-deutschland.de/2/, visit all 10 links and get the text.

Go to page 3 https://ausschreibungen-deutschland.de/3/, visit all 10 links and get the text.

Go to page 4 ...

How would I combine these two concepts?

CodePudding user response：

I've done the pagination in start_urls and you can increase or decrease the page numbers whatever you need.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/' str(x) '/' for x in range(1,11)]

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=False),
    )

    def parse(self, response):
        
        # just get all the text 
        #all_text = response.xpath("//text()").getall()

        yield {
            #"text": " ".join(all_text),
            'title':response.xpath('//*[@]/h2//text()').get(),
            "url": response.url
        }