I am trying to webscrape episode data from IMDB
as well as their reviews. I want to get all the episodes and store them in a dataframe
. However I am having an issue: only 1 review is being scraped per episode. When I was testing there was an instance where all the reviews were scraped but it is not working anymore. Does anyone know how I could scrape all the reviews and store it in a dataframe
?
Here is the code:
library(dplyr)
library(rvest)
library(tidyverse)
getReviewLink = function(episodeLink) {
episodePage = read_html(episodeLink)
container = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT")
reviewLinks = episodePage %>%
html_nodes(".Hero__WatchContainer__NoVideo-sc-kvkd64-9.cTdSBT > ul > li:nth-child(1) > a") %>%
html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
print(reviewLinks)
cleanedReviewLink = ifelse(reviewLinks == "https://www.imdb.com", NA, reviewLinks)
print(cleanedReviewLink)
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
print(get_reviews)
return(get_reviews)
}
episodes = data.frame()
for (page_result in seq(from = 1, to = 51, by = 50)){
link = paste0("https://www.imdb.com/search/title/?title_type=tv_episode&num_votes=1000,&sort=user_rating,desc&start"
,page_result,"&ref_=adv_nxt")
page = read_html(link)
show_name = page %>% html_nodes(".lister-item-index a") %>% html_text() %>% str_trim
episode_name = page %>% html_nodes("small a") %>% html_text()
episode_links = page %>% html_nodes("small a") %>% html_attr("href") %>%
paste("https://www.imdb.com", ., sep="")
episodeReview = sapply(episode_links, FUN = getReviewLink, USE.NAMES = FALSE)
print(episodeReview)
episodes = rbind(episodes, data.frame(show_name, episode_name, episodeReview, stringsAsFactors = FALSE))
print(paste("Page:", page_result))
}
Any help is appreciated.
CodePudding user response:
I ran your code and there is a small mistake in your function getReviewLink
.
The following part is removing all the reviews and retuning only the first review.
get_reviews = ifelse(is.na(cleanedReviewLink), NA, read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim())
The replace it with
get_reviews = read_html(cleanedReviewLink) %>% html_nodes(".show-more__control") %>%
html_text() %>% str_trim() %>% str_subset(". ")
[1] "I haven't seen every episode in the world, but this is as close to perfect as I have ever seen. Never thought I would say something could match the likes of Lord of the Rings. It's the most cinematic episode I've ever seen with maybe \"Battle of the Bastards\" being alongside it for obvious reasons, but for an animated episode to do this is even more shocking. It would be hard for someone to imagine an animated episode being as cinematic as an HBO production and this episode made it possible. This episode was also very emotional as two of my favorite characters' (Armin and Erwin) lives were on the line. I even cried when Armin was gonna make the sacrifice. It was also sad to see Erwin's unfortunate plan come to fruition, but he did it knowing it was for a greater good. I also loved the parallels of all the main characters sacrificing themselves in this episode."
[2] "I cant understand how anyone can rate something as incredible like this below 10. This episode is amazingly godly and will go down in history as one of the greats"
Further, you are not actually scraping all the reviews. For example there are 951 review for the episode https://www.imdb.com/title/tt9906260/reviews?ref_=tt_ov_rt
But your code gets you only first 25 reviews. If you need all the reviews displayed you need to keep on clicking on Load More. This can be by RSelenium
or may be imdbapi
.