Home > OS >  Scraping infinite scroll sites where secondary requests are dependent on an initial request using Sc
Scraping infinite scroll sites where secondary requests are dependent on an initial request using Sc

Time:07-18

I'm scraping the website [schwaebischealb.de] https://www.schwaebischealb.de/salb/ukv?searchtext=&date_from=28.07.2022&date_to=05.08.2022&numberOfRooms=2&number_adult[]=1&number_child[]=0&age_child1[]=&age_child2[]=&age_child3[]=&age_child4[]=&number_adult[]=1&number_child[]=0&age_child1[]=&age_child2[]=&age_child3[]=&age_child4[]=&number_adult[]=&number_child[]=0&age_child1[]=&age_child2[]=&age_child3[]=&age_child4[]=&doSearch=4&active_tab=

The page has an infinite scroll feature, when the user scrolls to the bottom (sometimes a click on "show more" is necessary), whereupon a GET request is sent to [parameter: page=n] https://www.schwaebischealb.de/salb/ukv/result/?page=n, with n=2,3,...,n.

I want to scrape all the sites and parse the products. The code is below. The problem is, that the subpages do not contain products, when parsed by scrapy, however, the initial page works fine. When opening the subpages in an inkognito tab, the same problem appears. Also I tried to access them with Postman, right after I accessed the initial page, that works fine, they contain products. The intended functionality is obviously, that scrapy should be able to send requests to the subpages and that the respective responses contain products, just like the normal workflow of the webpage also does.

class AlbSpider(scrapy.Spider):
name = 'alb'
fromDate = "28.07.2022"  # dd.mm.yyy
toDate = "05.08.2022"
numNights = 8
numPersons = "2"
numRooms = numPersons
room1NumAdults = "1"  # number of adults in room 1
room2NumAdults = "1"  # number of adults in room 2
maxPrice = 800  # max price of the accommodation
siteCounter = 1
siteMaxCount = 25  # max count is 25
start_urls = [(f'https://www.schwaebischealb.de/salb/ukv?searchtext=&date_from={fromDate}'
               f'&date_to={toDate}&numberOfRooms={numRooms}&number_adult[]={room1NumAdults}&number_child[]=0'
               f'&age_child1[]=&age_child2[]=&age_child3[]=&age_child4[]=&number_adult[]={room2NumAdults}'
               f'&number_child[]=0&age_child1[]=&age_child2[]=&age_child3[]=&age_child4[]='
               f'&number_adult[]=&number_child[]=0&age_child1[]=&age_child2[]=&age_child3[]='
               f'&age_child4[]=&doSearch={siteCounter}&active_tab=')]

def parse(self, response):
    # clear json file
    with open("alb.json", "w") as f:
        f.write("")
    self.parseSite(response.url)
    newSiteUrl = "https://www.schwaebischealb.de/salb/ukv/result/?page=##site##"
    url = newSiteUrl.replace("##site##", str(self.siteCounter))
    while self.pageValid(url):
        self.parseSite(url)
        self.siteCounter  = 1
        url = newSiteUrl.replace("##site##", str(self.siteCounter))

def pageValid(self, url):
    # ensures that the page is valid, which is the case for all pages until page 26
    if int(url.split("=")[-1]) <= self.siteMaxCount:
        return True
    return False

I did some search on the web, but I only find basic "infinite scrolling" tutorials, but none where the secondary requests are dependent on an initial request.

Is there a function of scrapy that can handle this kind of issue? Or maybe other libraries like selenium?

CodePudding user response:

I just happened to fix it myself. This kind of functionality is included in scrapy. The problem with my code was, that I did not use the yield command, but imported the request library. Hence, the requests sent were unique and standed in no connection to the initial one, making them useless.

I let this post online, in case somebody else finds the same problem.

The code that fixed the issue:

    def parse(self, response):
    # clear json file
    with open("alb.json", "w") as f:
        f.write("")
    yield scrapy.Request(response.url, callback=self.parseSite)
    newSiteUrl = "https://www.schwaebischealb.de/salb/ukv/result/?page=##site##"
    url = newSiteUrl.replace("##site##", str(self.siteCounter))
    while self.pageValid(url):
        yield scrapy.Request(url, callback=self.parseSite)
        self.siteCounter  = 1
        url = newSiteUrl.replace("##site##", str(self.siteCounter))
  • Related