Home > Software engineering >  Scrapy, Selenium, Python - problem with pagination (missing pages)
Scrapy, Selenium, Python - problem with pagination (missing pages)

Time:07-21

I have a problem with running scrapy. Seems like scrapy is skiping last pages. For example I've set 20 pages to scrap but Scrapy is missing last 10 or 7 pages. It does not have any problem when Im setting one single page "for page in range(6,7)". Terminal shows that it is scraping all pages from 1 to 100 but output in my database is ending at random pages. Any ideas why is that heppening?

Mayber there is a way to run Scrapy synchronously. To scrap every item in first page -> second page -> third page and so on

class SomeSpider(scrapy.Spider):
    name = 'default'
    urls = [f'https://www.somewebsite.com/pl/c/cat?page={page}' for page in range(1, 101)]

    service = Service(ChromeDriverManager().install())
    options = Options()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument("--headless")
    options.add_argument("--allow-running-insecure-content")
    options.add_argument("--enable-crash-reporter")
    options.add_argument("--disable-popup-blocking")
    options.add_argument("--disable-default-apps")
    options.add_argument("--incognito")

    driver = webdriver.Chrome(service=service, options=options)

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )

    def parse(self, response):
        for videos in response.css('div.card-img'):
            item = WebsitescrapperItem()
            
            link = f'https://www.somewebsite.com{videos.css("a.item-link").attrib["href"]}'
            SomeSpider.driver.get(link)
            domain_name = SomeSpider.driver.current_url
            SomeSpider.driver.back()

            item['name'] = videos.css('span.title::text').get().strip()
            item['duration'] = videos.css('span.duration::text').get().strip()
            item['image'] = videos.css('img.thumb::attr(src)').get()
            item['url'] = domain_name
            item['hd'] = videos.css('span.hd-icon').get()

            yield item

CodePudding user response:

Try running the code using this format of calling,

def parse(self, response):
    # do some stuff
    for page in range(self.total_pages):
        yield Requests(f'https://example.com/search?page={page}', 
                          callback=self.parse)

Also, If you yield multiple requests from start_requests, or have multiple URLs in start_urls, those will be handled asynchronously, according to your concurrency settings (Scrapy’s default is 8 concurrent requests per domain, 16 total). Make sure you set accordingly in settings.py.

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

CodePudding user response:

If you want to run it synchronously you would do it like so.

def parse(self, response, current_page):
    url = 'https://www.somewebsite.com/pl/c/cat?page={}'
    # do some stuff
    self.current_page  = 1
    yield Request(url.format(current_page), call_back=self.parse)
  • Related