Web Scraping with R: problem with "data.frame" function and number of rows-CodePudding

Briefly, I want to scrap information from this site about movies. I was using Selector Gadget to scrap it and I wrote down this code:

library(dplyr)
library(tidyverse)
library(rvest)
library(readr)
library(purrr)

link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
page = read_html(link)

film_name = page %>% html_nodes(".lister-item-header a") %>% html_text()
year = page %>% html_nodes(".text-muted.unbold") %>% html_text()
rating = page %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
gross_income %>% html_nodes(".ghost~ .text-muted  span") %>% html_text()
duration = page%>% html_nodes(".runtime") %>% html_text()

IMDB_Adventure_Movies_Rank = data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE)

R console gives the following error:

Error in data.frame(film_name, year, rating, duration, gross_income, stringsAsFactors = FALSE) : 
  gli argomenti implicano un numero differente di righe: 50, 44

The error is due to the fact that, in the website, 6 films out of 50 have not the income reported.

I have tried this solution, but the values do not get arranged in the correct order, since R assigns the wrong incomes to each film

length(gross_income) = length(film_name)

My question is: how can I create a table where, in case a film hasn't the income reported, R returns something as NA or null, instead of giving me error? I saw that a guy had the same problem and the solution was to use the purrr package and the possibly() function. However, I am new to R and I can't understand the answer and how to use possibly().

Thanks for you help and forgive my inexperience.

CodePudding user response：

I would suggest that you reflect on using imdbapi. imdbapi is a package that facilitates access to IMDB Api. You will need to acquire an api key but the cost of that is fairly insignificant.

library("imdbapi")
res_film <-
    find_by_title("Top Gun: Maverick", api_key = <Your API KEY>)

When working against established data sources such as Eurostat, World Bank of IMDB for that matter is advisable to rely on maintained packages and available APIs. By scrapping data from the site using rvest you will have to accomplish a lot of unnecessary work and solve problems that were already solved by the API and package creators.

There is an alternative Open Movie Database that gives you some free queries with a fairly high limit, and offers a dedicated R package. Likely you should be able to acquire the information that you need like that with no cost.

CodePudding user response：

We can get the income of the movies by,

link = "https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=adventure&sort=user_rating,desc"
df = read_html(link) %>% html_nodes('#main div div.lister.list.detail.sub-list div div.lister-item-content p.sort-num_votes-visible') %>% html_text()
 [1] "\n                Votes:\n                1,766,474\n    |                Gross:\n                $377.85M\n            \n        "
 [2] "\n                Votes:\n                1,788,217\n    |                Gross:\n                $315.54M\n            \n        "
 [3] "\n                Votes:\n                2,253,349\n    |                Gross:\n                $292.58M\n            \n        "
 [4] "\n                Votes:\n                1,595,898\n    |                Gross:\n                $342.55M\n            \n        "

We now get votes and income for each movie. We shall filter income using regex.

library(stringi)
stri_extract_first_regex(df, "(?<=\\$).*")
 [1] "377.85M" "315.54M" "292.58M" "342.55M" "6.10M"   "188.02M" "290.48M" "10.06M"  "210.61M" "322.74M" "678.82M" NA        "187.71M" "422.78M" "190.24M"
[16] "858.37M" "209.73M" "223.81M" "2.38M"   "85.16M"  "248.16M" "47.70M"  "293.00M" "415.00M" "120.54M" "191.80M" "197.17M" "309.13M" NA        "56.95M" 
[31] "44.82M"  "13.28M"  NA        NA        "1.43M"   "356.46M" "381.01M" "4.71M"   "380.84M" "402.45M" "1.23M"   "12.10M"  "44.91M"  NA        "5.01M"  
[46] "1.03M"   "5.45M"   "8.18M"   NA        "59.10M"