Home > database >  Can't scrape tile; Python dictionary returns 'None'
Can't scrape tile; Python dictionary returns 'None'

Time:02-12

I am trying to scrape a job title from Indeed.com.

Here is my code:

import scrapy

class JobsNySpider(scrapy.Spider):
    name = 'jobs_ny'
    allowed_domains = ['www.indeed.com']
    start_urls = ['https://www.indeed.com/jobs?q=analyst&l=New York, NY&vjk=b588911bd50d7ab1']

    def parse(self, response):
       jobs = response.xpath("//td[@class='resultContent']")
       for job in jobs:
           yield {
            'title': job.xpath(".//h2[@class='jobTitle']/span/text()").get()
           }
    
       next_page=response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/@href").get())

       if next_page:
           yield scrapy.Request(url=next_page, callback=self.parse)

For some reason, Python dictionary returns {'title': None}. I disabled JavaScript to make sure I am scraping HTML markup.

CodePudding user response:

Your xpath selector for the title is incorrect because you are using the selector @class='jobTitle' yet the h2 element has multiple classes on it so your selector will not find a matching element. Try using the contains function as shown below.

yield {
   'title': job.xpath(".//h2[contains(@class,'jobTitle')]/span/text()").get()
}

Alternatively, ensure you capture all the classes on the element if you want to select using @class=.... This might be unstable since some classes might change from time to time. See below sample

yield {
   'title': job.xpath(".//h2[@class='jobTitle jobTitle-color-purple']/span/text()").get()
}

I recommend using the contains function with the class name that is common on all the elements you want to select.

  • Related