Home > Software engineering >  Web scraping with pagination doesn't return all results
Web scraping with pagination doesn't return all results

Time:02-16

I am trying to scrape Indeed.com but having a problem with pagination. Here is my code:

import scrapy
class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New York, NY&vjk=7b2f6385304ffc78']

    def parse(self, response):
        jobs = response.xpath("//td[@id='resultsCol']")
        for job in jobs:
            yield {
                'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
                'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
                'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
                'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
                'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
        }

        next_page = response.urljoin(response.xpath("//a[@aria-label='Next']/@href").get())

        if next_page:
           yield scrapy.Request(url=next_page, callback=self.parse)

The problem is that according to Indeed there are 28,789 jobs that match my query. However, when I save what I've scraped to csv file, there are only 76 rows. I also tried: next_page = response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/@href").get()) but the result was similar. So my question is what I am doing wrong while handling the pagination.

CodePudding user response:

  1. The problem is not with the pagination, it's that you only get one job from each page.
  2. It's better to do the urljoin after the if statement to avoid errors.
import scrapy


class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analytics&l=New York, NY&vjk=7b2f6385304ffc78']

    def parse(self, response):
        jobs = response.xpath('//div[@id="mosaic-provider-jobcards"]/a')
        for job in jobs:
            yield {
                'Job_title': job.xpath(".//td[@class='resultContent']/div/h2/span/text()").get(),
                'Company_name': job.xpath(".//span[@class='companyName']/a/text()").get(),
                'Company_rating': job.xpath(".//span[@class='ratingNumber']/span/text()").get(),
                'Company_location': job.xpath(".//div[@class='companyLocation']/text()").get(),
                'Estimated_salary': job.xpath(".//span[@class='estimated-salary']/span/text()").get()
            }

        next_page = response.xpath("//a[@aria-label='Next']/@href").get()

        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url=next_page, callback=self.parse)
  • Related