Home > other >  stuck scraping the same 2nd page with infinite scroll
stuck scraping the same 2nd page with infinite scroll

Time:02-06

I'm trying to scrape game reviews from steam. when running the spider above, I get the first page with 10 reviews. then the second page with 10 reviews three times

class MySpider(scrapy.Spider):
    name = "MySpider"
    download_delay = 6
    page_number = 1
    start_urls = (
    'https://steamcommunity.com/app/1794680/reviews/', 
    )

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'LOG_ENABLED': False,
        'LOG_FILE': 'logging.txt',
        'LOG_FILE_APPEND': False,
        'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
        'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
    }

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
           {...}
            
        if(self.page_number<4):
            self.page_number  =1
            yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
            

json output

I took a few request when scrolling the reviews. I changed all values that looked like page number and replaced them with {p}, also I tried changing the 'userreviewsoffset' to fit the request format

i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.

CodePudding user response:

Your issue is with userreviewscursor=AoIIPwYYanu12fcD part of the url. That bit will change for every call, and you can find it in the HTML response under:

<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">

Get that value and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).

  • Related