Home > Blockchain >  scrapy is not crawling through the links
scrapy is not crawling through the links

Time:12-15

I was crawling using scrapy by link extractor, I'm using correct XPath expressions in scrapy link extractor but I don't know why it is going infinite and printing some kind of source code instead of the name and address of the restaurant. I know there is some error in my restrict XPath expression but not able to figure out what it is

code :

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['www.tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1[@]/text()').get(),
            'Address': response.xpath('(//a[@])[2]').get()
        }

CodePudding user response:

It is crawling, try changing you user_agent. But you forgot to add /text() in the address.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@]//a'), callback='parse_item'),
        Rule(LinkExtractor(restrict_xpaths='//a[contains(@class, "next")]')),   # pagination
    )

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1[@]/text()').get(),
            'Address': response.xpath('(//a[@])[2]/text()').get()
        }

Output:

{'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
{'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
{'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
{'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
...
...
...
  • Related