Home > Software engineering >  Fix error in a For Loop used to extract reviews of a product given different urls
Fix error in a For Loop used to extract reviews of a product given different urls

Time:09-17

I'm trying to extract the reviews of a product on Amazon, the urls of the reviews are placed on the same url with different page numbers, running manually this script is working but I need to change manually the number of the page in the url and the name of the tibble and run each time to get a different tibble.

Since it's quite boring for almost 70 pages I was trying to make a for loop to do the same thing the under the loop that I tried to do but it gives me an error

MANUAL 
```
library(tidyr)
library(rvest)

url_reviews <- "https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=16"
doc <- read_html(url_reviews) # Assign results to `doc`

# Review Title
doc %>% 
  html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
  html_text() -> review_title

# Review Text
doc %>% 
  html_nodes("[class='a-size-base review-text review-text-content']") %>%
  html_text() -> review_text

# Number of stars in review
doc %>%
  html_nodes("[data-hook='review-star-rating']") %>%
  html_text() -> review_star

# Return a tibble
page_16<-data.frame(review_title,
                review_text,
                review_star,
                page =16) 


FOR LOOP

``` 
range <- 12:82
    url_max <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_", range ,"?ie=UTF8&reviewerType=all_reviews&pageNumber=",range)
    
    
    for (i in 1:length(url_max)) {
     
      doc <- read_html(url_max[i]) # Assign results to `doc`
      
      # Review Title
      doc %>% 
        html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
        html_text() -> review_title
      
      # Review Text
      doc %>% 
        html_nodes("[class='a-size-base review-text review-text-content']") %>%
        html_text() -> review_text
      
      # Number of stars in review
      doc %>%
        html_nodes("[data-hook='review-star-rating']") %>%
        html_text() -> review_star
      
      
      paste0("page_", range)<-tibble(review_title,
                                              review_text,
                                              review_star,
                                              page = paste0("a", i)) 
                                                                                       
  }
     ```

CodePudding user response:

Here's another alternative that defines a function and then uses lapply() to sequentially run the function.

The following might, however, be helpful for repeating this as necessary for different products. The function accepts two parameters, the first i is the page number and the second product is the product for which you are gathering reviews. The function constructs the url by pasting the appropriate page number.

While I used lapply(), the function below could also be inserted in the map_df() function in Ronak's answer (and would likely be faster than binding rows).

library(dplyr)
library(rvest)
library(stringr)

retrieve_reviews <- function(i, product) {

    urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}"
    url <- str_interp(urlstr, list(product = product, i = i))
    doc <- read_html(url) # Assign results to `doc`
    
    # Review Title
    doc %>% 
        html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
        html_text() -> review_title
    
    # Review Text
    doc %>% 
        html_nodes("[class='a-size-base review-text review-text-content']") %>%
        html_text() -> review_text
    
    # Number of stars in review
    doc %>%
        html_nodes("[data-hook='review-star-rating']") %>%
        html_text() -> review_star
    
    return(tibble(
        title = review_title,
        text = review_text,
        star = review_star,
        page = paste0("a", i)
    ))
}


range <- 12:82
product <- "B07WTHVQZH"
reviews <- lapply(range, retrieve_reviews, product) %>%
    bind_rows()

CodePudding user response:

You can use map_df from purrr to use loop.

library(rvest)

page_numbers <- 12:82

purrr::map_df(page_numbers, ~{
  url_reviews <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=", .x)  
  doc <- read_html(url_reviews) # Assign results to `doc`
  
  
  # Review Title
  doc %>% 
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
    html_text() -> review_title
  
  # Review Text
  doc %>% 
    html_nodes("[class='a-size-base review-text review-text-content']") %>%
    html_text() -> review_text
  
  # Number of stars in review
  doc %>%
    html_nodes("[data-hook='review-star-rating']") %>%
    html_text() -> review_star
  
  # Return a tibble
  data.frame(review_title,
              review_text,
              review_star,
              page =.x) 
}) -> result

result
  • Related