I am trying to scrape the amazon product review count number and then convert it to an integer. The code works 50% of the time when the code is the same. It seems to be that the object under the variable review_count is not always found, which gives and error when html2text runs.
I don't understand why there the object is not found every time. Is there a better way of going about this? I appreciate any help.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import html2text
import re
url = 'https://www.amazon.com/product-reviews/B07V37GVY9/pageNumber=1'
s = HTMLSession()
r = s.get(url)
soup1 = BeautifulSoup(r.content, "html.parser")
review_count = soup1.find(string=re.compile("with reviews"))
review_txt = (html2text.html2text(review_count))
reviews_list = review_txt.split()
reviews = reviews_list.pop(3)
reviews = reviews.replace(",","")
reviews = int(reviews)
print(reviews)
CodePudding user response:
You are getting rate limited, and you need to slow down your request count.
You can, however, check for the rate limit using code:
import time
CAPTCHA_TEXT = "Sorry, we just need to make sure you're not a robot."
r = s.get(url)
# If we get rate limited
while CAPTCHA_TEXT in r.text:
# Wait for a bit
time.sleep(30)
# And try again :)
r = s.get(url)