Home > OS >  unable to scrape elements using link extractor rule using scrapy
unable to scrape elements using link extractor rule using scrapy

Time:09-28

I am trying to scrape this website, I want the address and contact details but I don't know why i am getting None as output, the data I want is present in the response but i can't scrape it please tell me where I am doing wrong, I have wasted plenty of time, just stuck

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MobilesSpider(CrawlSpider):
    name = 'mobiles'
    allowed_domains = ['www.vcsdata.com']
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'

    def set_user_agent(self, request, response):
        request.headers['User-Agent'] = self.user_agent
        return request

    def start_requests(self):
        yield scrapy.Request(url='https://www.vcsdata.com/companies_gurgaon.html',
                             headers={
                                 'User_Agent': self.user_agent
                             })
    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//div/a[@class="text-dark"]')), callback='parse_item', follow=True, process_request='set_user_agent'),
    )

    def parse_item(self, response):
        data = response.url
        print(data)
        address = response.xpath('/html/body/div/section[2]/div/div/div[1]/div[2]/div[2]/div/div/div[1]/h6/text()').get()
        print(address)

CodePudding user response:

You might have some mistakes in your XPath selector. However, I would advise you to avoid using a complete XPath from document root. Although it works, it is quite fragile, as even a minor change in HTML would break your parsing. By using // instead, you will have a shorter and more reliable selector, eg.

response.xpath('//h6[contains(., "Address")]/text()').get()

Also, instead of having a set_user_agent method, you could define the User-Agent in Scrapy settings (eg. in settings.py file or using the custom_settings properties) :

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
  • Related