Home > database >  Scrapy - Filtering offsite request but in allowed domains?
Scrapy - Filtering offsite request but in allowed domains?

Time:11-18

i have the following code and would like go step by step to the next pages of the site:

import scrapy

class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['https://www.tripadvisor.co.uk']
    start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']

    def parse(self, response):
        tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
          yield {
            "link": response.urljoin(elem.xpath(".//a/@href").get())
          }

        nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
        if nextPage != None:
          nextPage = response.urljoin(nextPage)
          yield scrapy.Request(nextPage, callback=self.parse)     

But when i run this code only the very first page is scraped and i get this error-message:

2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>

Only when i delete this line i get all results

allowed_domains = ['https://www.tripadvisor.co.uk']

Why is that - the link to the following site has the allowed domain?

CodePudding user response:

In default spider allowed_domains isn't mandatory. It's always better practice to exclude it for the purpose of minimizing error. Another point is that you can remove allowed_domains or you have to exclude https:// meaning you can include www.tripadvisor.co.uk as allowed_domains according to scrapy doc. That's why you are getting error for this https:// portion.

The correct way is as follows:

allowed_domains = ['www.tripadvisor.co.uk']
  • Related