Home > Back-end >  Fetching the trail of URLs for every request using Scrapy
Fetching the trail of URLs for every request using Scrapy

Time:06-18

I am trying to store the trail of URLs my Spider visits every time it visits the target page. I am having trouble with reading the starting URL and ending URL for each request. I have gone through the documentation and this is as far as I can go using examples from the documentation.

Here is my Spider class

class MinistryProductsSpider(CrawlSpider):
    name = "ministryproducts"
    allowed_domains = ["www.ministryofsupply.com"]
    start_urls = ["https://www.ministryofsupply.com/"]
    base_url = "https://www.ministryofsupply.com/"
    rules = [
        Rule(
            LinkExtractor(allow="products/"),
            callback="parse_products",
            follow=True,
            process_request="main",
        )
    ]

I have a separate function for callback which parses data on every product page. The documentation doesn't specify if I can use callback and process_request at in the same Rule.

def main(self, request, response):
        trail = [link for link in response.url]
        return Request(response.url, callback=self.parse_products, meta=dict(trail))

def parse_products(self, response, trail):
        self.logger.info("Hi this is a product page %s", response.url)
        parser = Parser()
        item = parser.parse_product(response, trail)

        yield item

I have been stuck at this point for the past 4 hours. My Parser class is running absolutely fine. I am also looking for an explanation of best practices in this case.

CodePudding user response:

I solved the problem by creating a new scrapy.request object by iterating over href values on a tags on the catalogue page.

parser = Parser()

    def main(self, response):
        href_list = response.css("a.CardProduct__link::attr(href)").getall()
        for link in href_list:
            product_url = self.base_url   link
            request = Request(product_url, callback=self.parse_products)
            visited_urls = [request.meta.get("link_text", "").strip(), request.url]
            trail = copy.deepcopy(response.meta.get("visited_urls", []))   visited_urls
            request.meta["trail"] = trail
            yield request

    def parse_products(self, response):
        self.logger.info("Hi this is a product page %s", response.url)
        item = self.parser.parse_product(response)

        yield item
  • Related