i have the following code and would like go step by step to the next pages of the site:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
yield {
"link": response.urljoin(elem.xpath(".//a/@href").get())
}
nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(nextPage, callback=self.parse)
But when i run this code only the very first page is scraped and i get this error-message:
2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>
Only when i delete this line i get all results
allowed_domains = ['https://www.tripadvisor.co.uk']
Why is that - the link to the following site has the allowed domain?
CodePudding user response:
In default spider allowed_domains
isn't mandatory. It's always better practice to exclude it for the purpose of minimizing error.
Another point is that you can remove allowed_domains
or you have to exclude
https://
meaning you can include www.tripadvisor.co.uk
as allowed_domains
according to scrapy doc. That's why you are getting error for this https://
portion.
The correct way is as follows:
allowed_domains = ['www.tripadvisor.co.uk']