Home > Back-end >  Scraping reviews from Multiple pages in R
Scraping reviews from Multiple pages in R

Time:05-30

I was struggling to get the scraping done on a web page. My task is to scrape the reviews from the website and run a sentiment analysis on it. But I have only managed to get the Scraping done on the first page, How can I scrape all the reviews of the same movie distributed on multiple pages.

This is my code:

library(rvest)

read_html("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
  html_elements(xpath = "//div[@class='the_review']") %>% 
  html_text2()

This only gets me the reviews from the first page but I need reviews from all the pages. Any help would be highly appreciated.

CodePudding user response:

You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the title id needs to be picked up and the offset parameters can be set as ''.

After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.


Acknowledgments: I took the idea of using repeat() from @flodal here


library(tidyverse)
library(httr2)

get_reviews <- function(results, n) {
  r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
    req_headers("user-agent" = "mozilla/5.0") %>%
    req_perform() %>%
    resp_body_html() %>%
    toString()

  title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
  start_cursor <- ""
  end_cursor <- ""

  repeat {
    r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
      req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
      req_perform() %>%
      resp_body_json()
    results[[n]] <- r$reviews
    nextPage <- r$pageInfo$hasNextPage

    if (!nextPage) break

    start_cursor <- r$pageInfo$startCursor
    end_cursor <- r$pageInfo$endCursor
    n <- n   1
  }
  return(results)
}

n <- 1
results <- list()  
data <- get_reviews(results, n)

df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
  date = .x$creationDate,
  reviewer = .x$publication$name,
  url = .x$reviewUrl,
  quote = .x$quote,
  score = if (is.null(.x$scoreOri)) {
    NA_character_
  } else {
    .x$scoreOri
  },
  sentiment = .x$scoreSentiment
))
  • Related