How to tell Python to continue extracting even if the number of reviews isn't reached?-CodePudding

For a project, I'm extracting hotel reviews using python. I have a list of 100 hotels, and I'm extracting 1.500 reviews from each. The problem is that some of the hotels don't have that many reviews. What is happening is that when 1.500 is not reached the loop stops and shows an error.

Here is my code:

    # The number of reviews to obtain per hotel
    reviewsToGet = 1500

    # Loop for all hotels
    for index, row in hotelsToScrap.iterrows():

        # Present feedback on which hotel is being processed
        print("Processing hotel", index)

        # Reset counter per hotel
        reviewsExtracted = 0    

        # Loop until it extracts the pre-defined number of reviews
        while reviewsExtracted<reviewsToGet:

            # Define URL to use based on the number of reviews extracted so far
            urlToUse = row['URL']
            if reviewsExtracted>0:
                repText = "-Reviews-or" str(reviewsExtracted) "-"
                urlToUse = urlToUse.replace("-Reviews-",repText)

            # Open and read the web page content
            soup = openPageReadHTML(urlToUse)

            # Process web page
            hotelReviews = processPage(soup, index, hotelReviews)

            # Update counter
            reviewsExtracted = reviewsExtracted   5

            # Present feedback on the number of extracted reviews
            print("Extracted ",reviewsExtracted,"/",reviewsToGet)
         

    # Save the extracted reviews data frame to an Excel file
    hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

What do I do to get the extraction going even if the 1.500 is not reached?

CodePudding user response：

It's hard to tell without knowing the exact error, but I suppose that when the while loop condition is true, that is when reviewsExtracted is less than reviewsToGet and there is actually no more reviews, some of your code is still trying to extract reviews, but there is actually no more reviews! Here:

repText = "-Reviews-or" str(reviewsExtracted) "-"
urlToUse = urlToUse.replace("-Reviews-",repText)

You just set the new URL urlToUse using the value reviewsExtracted which is probably not on the website.

So I would add one more condition to first check if there are reviews to extract before extracting, or use try and except to catch the exact error, I think it happens in this line:

soup = openPageReadHTML(urlToUse)

CodePudding user response：

From your post I presume the 1500 number is a ceiling, eg. "this is the maximum amount of reviews I want to get for each hotel. If a hotel has less than 1500 reviews, get all of them, and then move on."

I would suggest switching your approach from relying on the 1500 maximum reviews as a condition to using it as a signal indicating when to stop, and move on to the next hotel.

However if you'd just like to get your current implementation running, do as several of the other commenters have said: Implement exception handling so that your process can continue when it encounters an error.

Example, using your code:

reviewsToGet = 1500

# Loop for all hotels
for index, row in hotelsToScrap.iterrows():

    print("Processing hotel", index)
    reviewsExtracted = 0    
    
    try:
        # Loop until it extracts the pre-defined number of reviews
        while reviewsExtracted<reviewsToGet:

            # Define URL to use based on the number of reviews extracted so far
            urlToUse = row['URL']
            if reviewsExtracted>0:
                repText = "-Reviews-or" str(reviewsExtracted) "-"
                urlToUse = urlToUse.replace("-Reviews-",repText)

            # Open and read the web page content
            soup = openPageReadHTML(urlToUse)

            # Process web page
            hotelReviews = processPage(soup, index, hotelReviews)

            # Update counter
            reviewsExtracted = reviewsExtracted   5

            # Present feedback on the number of extracted reviews
            print("Extracted ",reviewsExtracted,"/",reviewsToGet)
    except Exception as err:
        print("[ ] Exception encountered! Most likely because hotel has too few reviews, but check stack trace. Continuing anyways.")
        pass

    # Save the extracted reviews data frame to an Excel file
    hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

CodePudding user response：

The simplest approach would be to loop through while within a try-except{} block as follows:

# The number of reviews to obtain per hotel
reviewsToGet = 1500

# Loop for all hotels
for index, row in hotelsToScrap.iterrows():

    # Present feedback on which hotel is being processed
    print("Processing hotel", index)

    # Reset counter per hotel
    reviewsExtracted = 0    

    # Loop until it extracts the pre-defined number of reviews
    while reviewsExtracted<reviewsToGet:
        try:
                # Define URL to use based on the number of reviews extracted so far
                urlToUse = row['URL']
                if reviewsExtracted>0:
                    repText = "-Reviews-or" str(reviewsExtracted) "-"
                    urlToUse = urlToUse.replace("-Reviews-",repText)

                # Open and read the web page content
                soup = openPageReadHTML(urlToUse)

                # Process web page
                hotelReviews = processPage(soup, index, hotelReviews)

                # Update counter
                reviewsExtracted = reviewsExtracted   5

                # Present feedback on the number of extracted reviews
                print("Extracted ",reviewsExtracted,"/",reviewsToGet)
        except:
            continue

# Save the extracted reviews data frame to an Excel file
hotelReviews.to_excel("ExtractedReviewsComplete.xlsx")

PS: Instead of catching the raw exception you must catch only the required exceptions you want to avoid.