I was crawling using scrapy by link extractor, I'm using correct XPath expressions in scrapy link extractor but I don't know why it is going infinite and printing some kind of source code instead of the name and address of the restaurant. I know there is some error in my restrict XPath expression but not able to figure out what it is
code :
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TripadSpider(CrawlSpider):
name = 'tripad'
allowed_domains = ['www.tripadvisor.in']
start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'title': response.xpath('//h1[@]/text()').get(),
'Address': response.xpath('(//a[@])[2]').get()
}
CodePudding user response:
It is crawling, try changing you user_agent. But you forgot to add /text()
in the address.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TripadSpider(CrawlSpider):
name = 'tripad'
allowed_domains = ['tripadvisor.in']
start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//a[contains(@class, "next")]')), # pagination
)
def parse_item(self, response):
yield {
'title': response.xpath('//h1[@]/text()').get(),
'Address': response.xpath('(//a[@])[2]/text()').get()
}
Output:
{'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
{'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
{'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
{'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
...
...
...