I'm trying to scrape game reviews from steam. when running the spider above, I get the first page with 10 reviews. then the second page with 10 reviews three times
class MySpider(scrapy.Spider):
name = "MySpider"
download_delay = 6
page_number = 1
start_urls = (
'https://steamcommunity.com/app/1794680/reviews/',
)
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'LOG_ENABLED': False,
'LOG_FILE': 'logging.txt',
'LOG_FILE_APPEND': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'FEEDS': {"items.json": {"format": "json", 'overwrite': True},},
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
for review in soup.find_all('div', class_="apphub_UserReviewCardContent"):
{...}
if(self.page_number<4):
self.page_number =1
yield scrapy.Request('https://steamcommunity.com/app/1794680/homecontent/?userreviewscursor=AoIIPwYYanu12fcD&userreviewsoffset={offset}&p={p}&workshopitemspage={p}&readytouseitemspage={p}&mtxitemspage={p}&itemspage={p}&screenshotspage={p}&videospage={p}&artpage={p}&allguidepage={p}&webguidepage={p}&integratedguidepage={p}&discussionspage={p}&numperpage=10&browsefilter=trendweek&browsefilter=trendweek&l=english&appHubSubSection=10&filterLanguage=default&searchText=&maxInappropriateScore=100'.format(offset=10*(self.page_number-1) ,p=self.page_number),method='GET', callback=self.parse)
I took a few request when scrolling the reviews. I changed all values that looked like page number and replaced them with {p}, also I tried changing the 'userreviewsoffset' to fit the request format
i noticed that 'userreviewscursor' has a changing value every request but I don't know where it is from.
CodePudding user response:
Your issue is with userreviewscursor=AoIIPwYYanu12fcD
part of the url. That bit will change for every call, and you can find it in the HTML response under:
<input type="hidden" name="userreviewscursor" value="AoIIPwYYanLi8vYD">
Get that value
and add it to the next call, and you're alright. (didn't want to babysit you and write the full code, but if needs be, let me know).