For a project, I'm extracting hotel reviews using python. I have a list of 100 hotels, and I'm extracting 1.500 reviews from each. The problem is that some of the hotels don't have that many reviews. What is happening is that when 1.500 is not reached the loop stops and shows an error.
Here is my code:
# The number of reviews to obtain per hotel
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
# Present feedback on which hotel is being processed
print("Processing hotel", index)
# Reset counter per hotel
reviewsExtracted = 0
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or" str(reviewsExtracted) "-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadHTML(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
What do I do to get the extraction going even if the 1.500 is not reached?
CodePudding user response:
It's hard to tell without knowing the exact error, but I suppose that when the while loop condition is true, that is when reviewsExtracted
is less than reviewsToGet
and there is actually no more reviews, some of your code is still trying to extract reviews, but there is actually no more reviews! Here:
repText = "-Reviews-or" str(reviewsExtracted) "-"
urlToUse = urlToUse.replace("-Reviews-",repText)
You just set the new URL urlToUse
using the value reviewsExtracted
which is probably not on the website.
So I would add one more condition to first check if there are reviews to extract before extracting, or use try
and except
to catch the exact error, I think it happens in this line:
soup = openPageReadHTML(urlToUse)
CodePudding user response:
From your post I presume the 1500 number is a ceiling, eg. "this is the maximum amount of reviews I want to get for each hotel. If a hotel has less than 1500 reviews, get all of them, and then move on."
I would suggest switching your approach from relying on the 1500 maximum reviews as a condition to using it as a signal indicating when to stop, and move on to the next hotel.
However if you'd just like to get your current implementation running, do as several of the other commenters have said: Implement exception handling so that your process can continue when it encounters an error.
Example, using your code:
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
print("Processing hotel", index)
reviewsExtracted = 0
try:
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or" str(reviewsExtracted) "-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadHTML(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
except Exception as err:
print("[ ] Exception encountered! Most likely because hotel has too few reviews, but check stack trace. Continuing anyways.")
pass
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
CodePudding user response:
The simplest approach would be to loop through while
within a try-except{}
block as follows:
# The number of reviews to obtain per hotel
reviewsToGet = 1500
# Loop for all hotels
for index, row in hotelsToScrap.iterrows():
# Present feedback on which hotel is being processed
print("Processing hotel", index)
# Reset counter per hotel
reviewsExtracted = 0
# Loop until it extracts the pre-defined number of reviews
while reviewsExtracted<reviewsToGet:
try:
# Define URL to use based on the number of reviews extracted so far
urlToUse = row['URL']
if reviewsExtracted>0:
repText = "-Reviews-or" str(reviewsExtracted) "-"
urlToUse = urlToUse.replace("-Reviews-",repText)
# Open and read the web page content
soup = openPageReadHTML(urlToUse)
# Process web page
hotelReviews = processPage(soup, index, hotelReviews)
# Update counter
reviewsExtracted = reviewsExtracted 5
# Present feedback on the number of extracted reviews
print("Extracted ",reviewsExtracted,"/",reviewsToGet)
except:
continue
# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")
PS: Instead of catching the raw exception you must catch only the required exceptions you want to avoid.