Home > Back-end >  How to use Scrapy to do pagination and visit all links found on each page
How to use Scrapy to do pagination and visit all links found on each page

Time:03-05

I have the following spider and I try to combine Pagination and Rules for visiting links on each page.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/1/']

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=True),
    )

    def parse(self, response):
        
        # just get all the text 
        all_text = response.xpath("//text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url
        }
        
        # visit next page 
        # next_page_url = response.xpath('//a[@]').extract_first()

        # if next_page_url is not None:
            # yield scrapy.Request(response.urljoin(next_page_url))

I would like to implement the following behavior:

Start with page 1 https://ausschreibungen-deutschland.de/1/, visit all 10 links and get the text. (already implemented)

Go to page 2 https://ausschreibungen-deutschland.de/2/, visit all 10 links and get the text.

Go to page 3 https://ausschreibungen-deutschland.de/3/, visit all 10 links and get the text.

Go to page 4 ...

How would I combine these two concepts?

CodePudding user response:

I've done the pagination in start_urls and you can increase or decrease the page numbers whatever you need.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/' str(x) '/' for x in range(1,11)]

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9] _"), callback='parse', follow=False),
    )

    def parse(self, response):
        
        # just get all the text 
        #all_text = response.xpath("//text()").getall()

        yield {
            #"text": " ".join(all_text),
            'title':response.xpath('//*[@]/h2//text()').get(),
            "url": response.url
        }
        
     
  • Related