I am trying to scrape this website, I want the address and contact details but I don't know why i am getting None as output, the data I want is present in the response but i can't scrape it please tell me where I am doing wrong, I have wasted plenty of time, just stuck
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MobilesSpider(CrawlSpider):
name = 'mobiles'
allowed_domains = ['www.vcsdata.com']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
def set_user_agent(self, request, response):
request.headers['User-Agent'] = self.user_agent
return request
def start_requests(self):
yield scrapy.Request(url='https://www.vcsdata.com/companies_gurgaon.html',
headers={
'User_Agent': self.user_agent
})
rules = (
Rule(LinkExtractor(restrict_xpaths=('//div/a[@class="text-dark"]')), callback='parse_item', follow=True, process_request='set_user_agent'),
)
def parse_item(self, response):
data = response.url
print(data)
address = response.xpath('/html/body/div/section[2]/div/div/div[1]/div[2]/div[2]/div/div/div[1]/h6/text()').get()
print(address)
CodePudding user response:
You might have some mistakes in your XPath selector. However, I would advise you to avoid using a complete XPath from document root. Although it works, it is quite fragile, as even a minor change in HTML would break your parsing. By using //
instead, you will have a shorter and more reliable selector, eg.
response.xpath('//h6[contains(., "Address")]/text()').get()
Also, instead of having a set_user_agent
method, you could define the User-Agent in Scrapy settings (eg. in settings.py
file or using the custom_settings
properties) :
USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'