I am trying to scrape a job title from Indeed.com.
Here is my code:
import scrapy
class JobsNySpider(scrapy.Spider):
name = 'jobs_ny'
allowed_domains = ['www.indeed.com']
start_urls = ['https://www.indeed.com/jobs?q=analyst&l=New York, NY&vjk=b588911bd50d7ab1']
def parse(self, response):
jobs = response.xpath("//td[@class='resultContent']")
for job in jobs:
yield {
'title': job.xpath(".//h2[@class='jobTitle']/span/text()").get()
}
next_page=response.urljoin(response.xpath("//ul[@class='pagination-list']/li[position() = last()]/a/@href").get())
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
For some reason, Python dictionary returns {'title': None}. I disabled JavaScript to make sure I am scraping HTML markup.
CodePudding user response:
Your xpath selector for the title
is incorrect because you are using the selector @class='jobTitle'
yet the h2
element has multiple classes on it so your selector will not find a matching element. Try using the contains function as shown below.
yield {
'title': job.xpath(".//h2[contains(@class,'jobTitle')]/span/text()").get()
}
Alternatively, ensure you capture all the classes on the element if you want to select using @class=...
. This might be unstable since some classes might change from time to time. See below sample
yield {
'title': job.xpath(".//h2[@class='jobTitle jobTitle-color-purple']/span/text()").get()
}
I recommend using the contains
function with the class name that is common on all the elements you want to select.