Home > Software engineering >  Using scrapy and xpath to parse data
Using scrapy and xpath to parse data

Time:11-16

I have been trying to scrape some data but keep getting a blank value or None. I've tried doing next sibling and failed (I probably did it wrong). Any and all help is greatly appreciated. Thank you in advance.

Website to scrape (final): https://www.unegui.mn/azhild-avna/ulan-bator/

Website to test (current, has less listings): https://www.unegui.mn/azhild-avna/mt-hariltsaa-holboo/slzhee-tehnik-hangamzh/ulan-bator/

Code Snippet:

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: 
    company = card.xpath(".//*[@class='announcement-block__company-name']/text()").extract_first()
    date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
    date = date_block[0]
    city = date_block[1]

    item = {'date': date,
           'city': city,
           'company': company
           }

HTML Snippet:

<div class="announcement-block__date">
<span class="announcement-block__company-name">Электро экспресс ХХК</span>
,          Өчигдөр 13:05,                  Улаанбаатар</div>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

Expected Output:

date = Өчигдөр 13:05
city = Улаанбаатар

Extra:

If anyone can tell me or refer me to how I can setup my pipeline file would be greatly appreciated. Is it correct to use pipeline or should you use items.py? Currently I have 3 spiders in the same project folder: apartments, jobs, cars. I need to clean my data and transform it. For example, for the jobs spider I am currently working on as shown above I want to create the following manipulations:

  • if salary is < 1000, then replace with string 'Negotiable'
  • if date contains the text "Өчигдөр" then replace with 'Yesterday' without deleting the time
  • if employer contains value 'Хувь хүн' then change company value to 'Хувь хүн'

my pipelines.py file:

from itemadapter import ItemAdapter


class ScrapebooksPipeline:
    def process_item(self, item, spider):
        return item

my items.py file:

import scrapy


class ScrapebooksItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

CodePudding user response:

Looks like you are missing indentation. Instead

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
    date = date_block[0]
    city = date_block[1]

Try this:

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
        date = date_block[0]
        city = date_block[1]

CodePudding user response:

  1. I changed your xpath to a smaller scope.
  2. extract_first() will get the first instance, so use getall() instead.
  3. In order to get the date I had to use regex (most of the results have time but not date so if you get a blank for the date it's perfectly fine).
  4. I can't read the language so I had to guess (kind of) for the city, but even if it's wrong you can get the point.
import scrapy
import re


class TempSpider(scrapy.Spider):
    name = 'temp_spider'
    allowed_domains = ['unegui.mn']
    start_urls = ['https://www.unegui.mn/azhild-avna/ulan-bator/']

    def parse(self, response, **kwargs):
        cards = response.xpath('//div[@]')

        # parse details
        for card in cards:
            company = card.xpath('.//span/text()').get()

            date_block = card.xpath('./text()').getall()

            date = date_block[1].strip()
            date = re.findall(r'(\d -\d -\d )', date)
            if date:
                date = date[0]
            else:
                date = ''

            city = date_block[1].split(',')[2].strip()

            item = {'date': date,
                    'city': city,
                    'company': company
                    }
            yield item

Output:

[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-07', 'city': 'Улаанбаатар', 'company': 'Arirang'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-11', 'city': 'Улаанбаатар', 'company': 'Altangadas'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
...
...
...
  • Related