Home > Software engineering >  Scraping and accounting for missing text inside div tags
Scraping and accounting for missing text inside div tags

Time:01-05

i want to scrape 3 identical class div tags of this website : https://www.riotgames.com/en/work-with-us/jobs , here are the tags:

<div >Data Science Intern (PhD) - Technology Research</div>
<div ></div>
<div >Riot Operations &amp; Support</div>
<div >Los Angeles, USA</div>

as you can see the second div tag has no text in between and I want to catch that and replace it with N/A for example, here is my code :

class RiotscraperSpider(scrapy.Spider):
name = 'riotscraper'
allowed_domains = ['www.riotgames.com']
start_urls = ['https://www.riotgames.com/en/work-with-us/jobs']
def parse(self, response):
    jobs = response.css('li.job-row.job-row--body')
    for job in jobs : 
        for i in job.css('a.job-row__inner.js-job-url') :
           yield{
                'job_name': i.css('div.job-row__col.job-row__col--primary::text').get(), 
                'Craft_name':i.css('div.job-row__col.job-row__col--secondary::text').getall()[0],
                'Team_name':i.css('div.job-row__col.job-row__col--secondary::text').getall()[1],
                'Office' : i.css('div.job-row__col.job-row__col--secondary::text').getall()[2]
            }

as you can see I'm terrible and really can't think of how to catch the missing text, I'm using scrapy

CodePudding user response:

Instead of getall() you can use get(default='N/A').

import scrapy


class RiotscraperSpider(scrapy.Spider):
    name = 'riotscraper'
    allowed_domains = ['www.riotgames.com']
    start_urls = ['https://www.riotgames.com/en/work-with-us/jobs']

    def parse(self, response):
        jobs = response.css('li.job-row.job-row--body')
        for job in jobs :
            for i in job.css('a.job-row__inner.js-job-url') :
                yield {
                    'job_name': i.css('div.job-row__col.job-row__col--primary::text').get(default='N/A'),
                    'Craft_name': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(2)::text').get(default='N/A'),
                    'Team_name': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(3)::text').get(default='N/A'),
                    'Office': i.css('div.job-row__col.job-row__col--secondary:nth-of-type(4)::text').get(default='N/A'),
                }
  • Related