Home > Software engineering >  How can we scrape missing values from IMDB in R?
How can we scrape missing values from IMDB in R?

Time:12-25

library(rvest)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")
title <- imdb_page %>% html_nodes(".lister-item-header a") %>% html_text()
rating <- imdb_page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
movies <- data.frame(title)
movies2 <- data.frame(rating)

Basically the code above is for scraping the titles and ratings of 50 movies. I want missing values also to be included as NAs.

However, it doesn't happen as IMDB hasn't included them in the HTML tag which only has actual values present (I have used SelectorGadget for getting the tags). So the observation count is 50 for titles and just 33 for ratings which is not what I want. I have tried using html_node() along with html_nodes() but R gives an error saying cannot use css and xpath together. I have also tried the trim=TRUE and replace(!nzchar(.), NA) but they don't work either.

Is there a way to solve this and ensure I get 50 ratings (including NAs or empty values)?

CodePudding user response:

You need to perform this parsing in 2 steps. First collect the parent nodes for all 50 of the movies with html_nodes(). Then you parse this collection of nodes with html_node() (without the s) to obtain a result for all 50 including the nodes missing the attribute.

library(rvest)
library(dplyr)

imdb_page <- read_html("https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv")

#get the parent node of the each movie
movies <- imdb_page %>% html_elements( "div.lister-item")

#now parse each movie node for the desired subnode
title <- movies %>% html_element(".lister-item-header a") %>% html_text()
rating <- movies %>% html_element(".ratings-imdb-rating strong") %>% html_text()

Note the update from html_node(s) to html_element(s) the current style in rvest 1.0

CodePudding user response:

We can use ratings-user-rating to get the whole list of ratings,

enter image description here

library(rvest)
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2019-12-31&countries=us&sort=alpha,asc&ref_=adv_prv" 

url %>% read_html() %>% html_nodes('.ratings-user-rating') %>% html_text2()

 [1] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X "
 [4] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X "
 [7] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[10] "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[13] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.5/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5/10 X "  
[16] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.7/10 X "
[19] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[22] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.7/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[25] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.9/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.3/10 X "
[28] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[31] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[34] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.1/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 8.3/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 7.1/10 X "
[37] "Rate this\n 1 2 3 4 5 6 7 8 9 10 5.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.3/10 X "
[40] "Rate this\n 1 2 3 4 5 6 7 8 9 10 3.2/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[43] "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.6/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "  
[46] "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.8/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 -/10 X "   "Rate this\n 1 2 3 4 5 6 7 8 9 10 6.6/10 X "
[49] "Rate this\n 1 2 3 4 5 6 7 8 9 10 4.4/10 X " "Rate this\n 1 2 3 4 5 6 7 8 9 10 1.5/10 X "

We further need to clean the data to get the ratings.

df %>% gsub(".*9 10", "", .) %>% str_sub(start=1, end=-7) %>% str_replace_all('-', replacement = NA_character_)

 [1] NA     NA     " 3.6" NA     " 4.9" " 4.1" " 7.4" " 4.6" NA     " 7.9" " 3.3" NA     " 6.5" " 6.6" " 5"   " 3.6" NA     " 4.7" " 3.1" " 5.4" NA     " 5.7"
[23] " 5.1" NA     NA     " 6.9" " 4.3" " 6.6" NA     NA     " 4.1" " 4.6" NA     " 5.1" " 8.3" " 7.1" " 5.8" " 3.4" " 3.3" " 3.2" NA     NA     NA     " 4.6"
[45] NA     " 6.8" NA     " 6.6" " 4.4" " 1.5"

Get the movie names,

movie = url %>% read_html() %>%  html_nodes(".lister-item-header a") %>% html_text()

data.frame(Movie = movie, ratings = df)
                                                    Movie Ratings
1                                              #1915House    <NA>
2                                              #Bodygoals    <NA>
3                                               #Followme     3.6
4                                             #FullMethod    <NA>
5                                                   #Like     4.9
6                                             #SquadGoals     4.1
  • Related