Home > Blockchain >  Not getting any data scraped when running the following code using Scrapy on Python
Not getting any data scraped when running the following code using Scrapy on Python

Time:03-24

This is the spider I am using to scrape email addresses and names of restaurants from tripadvisor

import scrapy

class RestaurantSpider(scrapy.Spider):
    name = 'tripadvisorbot'

    start_urls = [
        'https://www.tripadvisor.com/Restaurants-g188633-The_Hague_South_Holland_Province.html#EATERY_OVERVIEW_BOX'
    ]
  
    def parse(self, response):
        for listing in response.xpath('//div[contains(@class,"__cellContainer--")]'):
            link = listing.xpath('.//a[contains(@class,"__restaurantName--")]/@href').get()
            text = listing.xpath('.//a[contains(@class,"__restaurantName--")]/text()').get()
            complete_url = response.urljoin(link)
            yield scrapy.Request(
                url=complete_url,
                callback=self.parse_listing,
                meta={'link': complete_url,'text': text}
            )

        next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
        if next_url:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

    def parse_listing(self, response):
        link = response.meta['link']
        text = response.meta['text']
        email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
        yield {'Link': link,'Text': text,'Email': email}

I run the following command line in the Anaconda prompt to run the above Spider and save it as a json file

scrapy crawl tripadvisorbot -O tripadvisor.json

No data gets scraped, a json file is created but it's empty.

I am not sure what the problem is, I am quite new to web scraping and Python coding in general. All help would be much appreciated

thanks

CodePudding user response:

On my computer there is no class _cellContainer-- and __restaurantName-- in HTML.
Page uses random chars as class names.

But every item is in div directly in <div data-test-target="restaurants-list"> and I use this to get all items.

Later I get first <a> (which has image instead of name) and I skip text and complete_url but directly run reponse.follow(link).

And when I get page with details then I get reponse.url to get complete_url and h1 to get text


You can put all code in one file and run python script.py without creating project.

import scrapy

class RestaurantSpider(scrapy.Spider):
    name = 'tripadvisorbot'

    start_urls = [
        'https://www.tripadvisor.com/Restaurants-g188633-The_Hague_South_Holland_Province.html#EATERY_OVERVIEW_BOX'
    ]
  
    def parse(self, response):
        for listing in response.xpath('//div[@data-test-target="restaurants-list"]/div'):
            url = listing.xpath('.//a/@href').get()
            print('link:', url)
            if url:
                yield response.follow(url, callback=self.parse_listing)

        next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
        if next_url:
            yield response.follow(next_url)

    def parse_listing(self, response):
        print('url:', response.url)
        
        link = response.url
        text = response.xpath('//h1[@data-test-target]/text()').get()
        email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
        
        yield {'Link': link, 'Text': text, 'Email': email}

# --- run without project and save data in `output.json` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEEDS': {'output.json': {'format': 'json'}},  # new in 2.1
})
c.crawl(RestaurantSpider)
c.start()

Part of result:

{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4766834-Reviews-Bab_mansour-The_Hague_South_Holland_Province.html", "Text": "Bab mansour", "Email": null},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d3935897-Reviews-Milos-The_Hague_South_Holland_Province.html", "Text": "Milos", "Email": null},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d10902380-Reviews-Nefeli_deli-The_Hague_South_Holland_Province.html", "Text": "Nefeli deli", "Email": "mailto:[email protected]?subject=?"},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d8500914-Reviews-Waterkant-The_Hague_South_Holland_Province.html", "Text": "Waterkant", "Email": "mailto:[email protected]?subject=?"},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4481254-Reviews-Salero_Minang-The_Hague_South_Holland_Province.html", "Text": "Salero Minang", "Email": null},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d6451334-Reviews-Du_Passage-The_Hague_South_Holland_Province.html", "Text": "Du Passage", "Email": "mailto:[email protected]?subject=?"},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4451714-Reviews-Lee_s_Garden-The_Hague_South_Holland_Province.html", "Text": "Lee's Garden", "Email": null},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d2181693-Reviews-Warunee-The_Hague_South_Holland_Province.html", "Text": "Warunee", "Email": "mailto:[email protected]?subject=?"},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d8064876-Reviews-Sallo_s-The_Hague_South_Holland_Province.html", "Text": "Sallo's", "Email": "mailto:[email protected]?subject=?"},
{"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d16841532-Reviews-Saravanaa_Bhavan_Den_Haag-The_Hague_South_Holland_Province.html", "Text": "Saravanaa Bhavan Den Haag", "Email": "mailto:[email protected]?subject=?"},
  • Related