i want to scrape 3 identical class div tags of this website : https://www.riotgames.com/en/work-with-us/jobs , here are the tags:
<div >Data Science Intern (PhD) - Technology Research</div>
<div ></div>
<div >Riot Operations & Support</div>
<div >Los Angeles, USA</div>
as you can see the second div tag has no text in between and I want to catch that and replace it with N/A for example, here is my code :
class RiotscraperSpider(scrapy.Spider):
name = 'riotscraper'
allowed_domains = ['www.riotgames.com']
start_urls = ['https://www.riotgames.com/en/work-with-us/jobs']
def parse(self, response):
jobs = response.css('li.job-row.job-row--body')
for job in jobs :
for i in job.css('a.job-row__inner.js-job-url') :
yield{
'job_name': i.css('div.job-row__col.job-row__col--primary::text').get(),
'Craft_name':i.css('div.job-row__col.job-row__col--secondary::text').getall()[0],
'Team_name':i.css('div.job-row__col.job-row__col--secondary::text').getall()[1],
'Office' : i.css('div.job-row__col.job-row__col--secondary::text').getall()[2]
}
as you can see I'm terrible and really can't think of how to catch the missing text, I'm using scrapy
CodePudding user response:
Instead of getall()
you can use get(default='N/A')
.
import scrapy
class RiotscraperSpider(scrapy.Spider):
name = 'riotscraper'
allowed_domains = ['www.riotgames.com']
start_urls = ['https://www.riotgames.com/en/work-with-us/jobs']
def parse(self, response):
jobs = response.css('li.job-row.job-row--body')
for job in jobs :
for i in job.css('a.job-row__inner.js-job-url') :
yield {
'job_name': i.css('div.job-row__col.job-row__col--primary::text').get(default='N/A'),
'Craft_name': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(2)::text').get(default='N/A'),
'Team_name': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(3)::text').get(default='N/A'),
'Office': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(4)::text').get(default='N/A'),
}