I am trying to store the trail of URLs my Spider visits every time it visits the target page. I am having trouble with reading the starting URL and ending URL for each request. I have gone through the documentation and this is as far as I can go using examples from the documentation.
Here is my Spider class
class MinistryProductsSpider(CrawlSpider):
name = "ministryproducts"
allowed_domains = ["www.ministryofsupply.com"]
start_urls = ["https://www.ministryofsupply.com/"]
base_url = "https://www.ministryofsupply.com/"
rules = [
Rule(
LinkExtractor(allow="products/"),
callback="parse_products",
follow=True,
process_request="main",
)
]
I have a separate function for callback
which parses data on every product page. The documentation doesn't specify if I can use callback
and process_request
at in the same Rule
.
def main(self, request, response):
trail = [link for link in response.url]
return Request(response.url, callback=self.parse_products, meta=dict(trail))
def parse_products(self, response, trail):
self.logger.info("Hi this is a product page %s", response.url)
parser = Parser()
item = parser.parse_product(response, trail)
yield item
I have been stuck at this point for the past 4 hours. My Parser
class is running absolutely fine. I am also looking for an explanation of best practices in this case.
CodePudding user response:
I solved the problem by creating a new scrapy.request
object by iterating over href
values on a
tags on the catalogue page.
parser = Parser()
def main(self, response):
href_list = response.css("a.CardProduct__link::attr(href)").getall()
for link in href_list:
product_url = self.base_url link
request = Request(product_url, callback=self.parse_products)
visited_urls = [request.meta.get("link_text", "").strip(), request.url]
trail = copy.deepcopy(response.meta.get("visited_urls", [])) visited_urls
request.meta["trail"] = trail
yield request
def parse_products(self, response):
self.logger.info("Hi this is a product page %s", response.url)
item = self.parser.parse_product(response)
yield item