Failed to retrieve product listings pages from few categories-CodePudding

From this webpage I am trying to get that kind of link where different products are located. There are 6 categories having More info button which when I traverse recursively, I usually reach the target pages. This is one such product listings page I wish to get.

Please note that some of these pages have both product listing and more info buttons, which is why I failed to capture the product listing pages accurately.

Current spider looks like the following (fails to grab lots of product listings pages):

import scrapy

class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []
        for item in response.css(".match-height a.more-info::attr(href)").getall():
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url":inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

Expected output (randomly taken):

https://www.norgren.com/de/en/list/directional-control-valves/in-line-and-manifold-valves
https://www.norgren.com/de/en/list/pressure-switches/electro-mechanical-pressure-switches
https://www.norgren.com/de/en/list/pressure-switches/electronic-pressure-switches
https://www.norgren.com/de/en/list/directional-control-valves/sub-base-valves
https://www.norgren.com/de/en/list/directional-control-valves/non-return-valves
https://www.norgren.com/de/en/list/directional-control-valves/valve-islands
https://www.norgren.com/de/en/list/air-preparation/combination-units-frl

How to get all the product listings pages from the six categories?

CodePudding user response：

Maybe filter only pages that have at least one link to details? Here is an example of how to identify if a page meets the criteria you are searching for:

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url, callback=self.parse)

    def parse(self, response):
        link_list = []

        more_info_items = response.css(
            ".match-height a.more-info::attr(href)").getall()

        detail_items = [item for item in more_info_items if '/detail/' in item]
        if len(detail_items) > 0:
            print(f'This is a link you are searching for: {response.url}')

        for item in more_info_items:
            if not "/detail/" in item:
                inner_page_link = response.urljoin(item)
                link_list.append(inner_page_link)
                yield {"target_url": inner_page_link}

        for new_link in link_list:
            yield scrapy.Request(new_link, callback=self.parse)

I only printed the link to the console, but you can figure out how to log it to where you need.

CodePudding user response：

import scrapy


class NorgrenSpider(scrapy.Spider):
    name = 'norgren'
    start_urls = ['https://www.norgren.com/de/en/list']

    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url)

    def parse(self, response):
        # check if there are items in the page
        if response.xpath('//div[contains(@class, "item-list")]//div[@]/div[@]/a/@href'):
            yield scrapy.Request(url=response.url, callback=self.get_links, dont_filter=True)

        # follow "more info" buttons
        for url in response.xpath('//a[text()="More info"]/@href').getall():
            yield response.follow(url)

    def get_links(self, response):
        yield {"target_url": response.url}

        next_page = response.xpath('//a[@]/@href').get()
        if next_page:
            yield response.follow(url=next_page, callback=self.get_links)

What to do next: Check if a product can be on more than one page. If it is then you'll get duplicates, so create an item and write an item duplicates filter.