I'm trying to extract the reviews of a product on Amazon, the urls of the reviews are placed on the same url with different page numbers, running manually this script is working but I need to change manually the number of the page in the url and the name of the tibble and run each time to get a different tibble.
Since it's quite boring for almost 70 pages I was trying to make a for loop to do the same thing the under the loop that I tried to do but it gives me an error
MANUAL
```
library(tidyr)
library(rvest)
url_reviews <- "https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=16"
doc <- read_html(url_reviews) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# Return a tibble
page_16<-data.frame(review_title,
review_text,
review_star,
page =16)
FOR LOOP
```
range <- 12:82
url_max <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_", range ,"?ie=UTF8&reviewerType=all_reviews&pageNumber=",range)
for (i in 1:length(url_max)) {
doc <- read_html(url_max[i]) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
paste0("page_", range)<-tibble(review_title,
review_text,
review_star,
page = paste0("a", i))
}
```
CodePudding user response:
Here's another alternative that defines a function and then uses lapply()
to sequentially run the function.
The following might, however, be helpful for repeating this as necessary for different products. The function accepts two parameters, the first i
is the page number and the second product
is the product for which you are gathering reviews. The function constructs the url by pasting the appropriate page number.
While I used lapply()
, the function below could also be inserted in the map_df()
function in Ronak's answer (and would likely be faster than binding rows).
library(dplyr)
library(rvest)
library(stringr)
retrieve_reviews <- function(i, product) {
urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}"
url <- str_interp(urlstr, list(product = product, i = i))
doc <- read_html(url) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
return(tibble(
title = review_title,
text = review_text,
star = review_star,
page = paste0("a", i)
))
}
range <- 12:82
product <- "B07WTHVQZH"
reviews <- lapply(range, retrieve_reviews, product) %>%
bind_rows()
CodePudding user response:
You can use map_df
from purrr
to use loop.
library(rvest)
page_numbers <- 12:82
purrr::map_df(page_numbers, ~{
url_reviews <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=", .x)
doc <- read_html(url_reviews) # Assign results to `doc`
# Review Title
doc %>%
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>%
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# Return a tibble
data.frame(review_title,
review_text,
review_star,
page =.x)
}) -> result
result