I have this code:
import scrapy
class AstroSpider(scrapy.Spider):
name = "Astro"
allowed_domains = ['www.astrolighting.com']
start_urls = ['https://www.astrolighting.com/products']
def parse(self, response, **kwargs):
for link in response.css('article.product-listing-item a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_items)
def parse_items(self, response):
for link in response.css('div.variants.variants--large a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_item)
def parse_item(self, response):
print(f"!!!!!!!!!!!!!!!!!!!!!!!!!!!")
yield {
'name': response.css('div.detail__right h1::text').get(),
'material': response.css('div.detail__right p span::text').getall()[0],
'id': response.css('div.detail__right p span::text').getall()[1].strip()
}
So, the results of parsing is just empty. Why? It seems like function "parse_item" never evaluated
CodePudding user response:
I didn't test your code but according to @alexpdev you need to pass dont_filter to True
class AstroSpider(scrapy.Spider):
name = "Astro"
allowed_domains = ['www.astrolighting.com']
start_urls = ['https://www.astrolighting.com/products']
def parse(self, response, **kwargs):
for link in response.css('article.product-listing-item a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_items,dont_filter=True)
def parse_items(self, response):
for link in response.css('div.variants.variants--large a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_item,dont_filter=True)
def parse_item(self, response):
print(f"!!!!!!!!!!!!!!!!!!!!!!!!!!!")
yield {
'name': response.css('div.detail__right h1::text').get(),
'material': response.css('div.detail__right p span::text').getall()[0],
'id': response.css('div.detail__right p span::text').getall()[1].strip()
}
but I suggest checking parse_items links by
print(response.css('div.variants.variants--large a::attr(href)').getall())
CodePudding user response:
It is because the parse_item
method is never evaluated.
After walking through each step of your code, and the urls, I discovered that the link url extracted from the parse
method is an identical match to the link url extraced from the parse_items
method.
Scrapy by default filters urls that it has already visited, so when it encounters the same url in the request with the parse_item
callback it ignores it as a duplicate.