Home > front end >  Failed to retrieve product listings pages from few categories
Failed to retrieve product listings pages from few categories

Time:01-09

From this webpage I am trying to get that kind of link where different products are located. There are 6 categories having More info button which when I traverse recursively, I usually reach the target pages. This is one such product listings page I wish to get.

Please note that some of these pages have both product listing and more info buttons, which is why I failed to capture the product listing pages accurately.

Current spider looks like the following (fails to grab lots of product listings pages):

import scrapy

class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []
        for item in response.css(".match-height a.more-info::attr(href)").getall():
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url":inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

Expected output (randomly taken):

https://www.norgren.com/de/en/list/directional-control-valves/in-line-and-manifold-valves
https://www.norgren.com/de/en/list/pressure-switches/electro-mechanical-pressure-switches
https://www.norgren.com/de/en/list/pressure-switches/electronic-pressure-switches
https://www.norgren.com/de/en/list/directional-control-valves/sub-base-valves
https://www.norgren.com/de/en/list/directional-control-valves/non-return-valves
https://www.norgren.com/de/en/list/directional-control-valves/valve-islands
https://www.norgren.com/de/en/list/air-preparation/combination-units-frl

How to get all the product listings pages from the six categories?

CodePudding user response:

Maybe filter only pages that have at least one link to details? Here is an example of how to identify if a page meets the criteria you are searching for:

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []

        more_info_items = response.css(
            ".match-height a.more-info::attr(href)").getall()

        detail_items = [item for item in more_info_items if '/detail/' in item]
        if len(detail_items) > 0:
            print(f'This is a link you are searching for: {response.url}')

        for item in more_info_items:
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url": inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

I only printed the link to the console, but you can figure out how to log it to where you need.

CodePudding user response:

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url)

    def parse(self, response):
        # check if there are items in the page
        if response.xpath('//div[contains(@class, "item-list")]//div[@]/div[@]/a/@href'):
            yield scrapy.Request(url=response.url, callback=self.get_links, dont_filter=True)

        # follow "more info" buttons
        for url in response.xpath('//a[text()="More info"]/@href').getall():
            yield response.follow(url)

    def get_links(self, response):
        yield {"target_url": response.url}

        next_page = response.xpath('//a[@]/@href').get()
        if next_page:
            yield response.follow(url=next_page, callback=self.get_links)

What to do next: Check if a product can be on more than one page. If it is then you'll get duplicates, so create an item and write an item duplicates filter.

  •  Tags:  
  • Related