I'm trying web scrape using R and have encountered some missing values when scraping gross revenue from the IMDB movies data. May I know how could I automatically insert NA if the movie's gross revenue is unknown?
webpage <- read_html('https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature')
gross <- html_nodes(webpage,'.ghost~ .text-muted span')
gross <- html_text(gross_data)
CodePudding user response:
One option to achieve your desired result would be to select the nodes with the items first and to extract the revenue and/or other info from the single nodes (Thanks to @Dave2e for pointing out that using purrr::map_dfr
is not necessary). If you want to extract multiple pieces of information then I would suggest to put everything inside a data.frame:
library(rvest)
library(magrittr)
webpage <- read_html("https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature")
result <- data.frame(
name = webpage %>% html_nodes(".lister-item-content") %>% html_node("h3 a") %>% html_text(),
gross = webpage %>% html_nodes(".lister-item-content") %>% html_node(".ghost~ .text-muted span") %>% html_text()
)
head(result, 15)
#> name gross
#> 1 Knives Out: Mord ist Familiensache $165.36M
#> 2 Parasite $53.37M
#> 3 Midsommar $27.33M
#> 4 Joker $335.45M
#> 5 Once Upon a Time In... Hollywood $142.50M
#> 6 Die Addams Family $100.04M
#> 7 Avengers: Endgame $858.37M
#> 8 Der Leuchtturm $0.43M
#> 9 Stephen Kings Doctor Sleeps Erwachen <NA>
#> 10 1917 $159.23M
#> 11 Little Women $108.10M
#> 12 Es Kapitel 2 $211.59M
#> 13 The King <NA>
#> 14 The Gentlemen <NA>
#> 15 Captain Marvel $426.83M
Created on 2021-10-22 by the reprex package (v2.0.1)
CodePudding user response:
Trial and error
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
webpage <- read_html('https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature')
movies_link <- html_elements(webpage, r"(.sort-num_votes-visible)")
movies_link |> html_text2() |> str_extract("\\$([0-9,.] )")
#> [1] "$165.36" "$53.37" "$27.33" "$335.45" "$142.50" "$100.04" "$858.37"
#> [8] "$0.43" NA "$159.23" "$108.10" "$211.59" NA NA
#> [15] "$426.83" "$26.74" "$0.35" "$515.20" "$390.53" "$12.14" "$316.83"
#> [22] "$96.37" "$140.37" "$57.01" "$171.02" "$175.08" "$96.85" "$62.74"
#> [29] "$7.00" "$85.71" NA "$173.96" NA NA "$26.80"
#> [36] "$0.40" "$117.62" "$73.29" "$29.21" "$6.56" "$22.96" "$7.74"
#> [43] "$355.56" NA NA NA "$35.40" "$62.25" "$20.55"
#> [50] NA "$69.06" "$22.68" NA "$45.37" "$2.00" "$110.50"
#> [57] "$543.64" NA NA "$80.00" NA NA "$111.05"
#> [64] NA "$65.85" NA NA "$3.76" NA NA
#> [71] NA NA NA "$477.37" "$80.55" "$67.16" NA
#> [78] NA "$434.04" "$21.90" NA NA NA "$113.93"
#> [85] "$5.33" "$13.12" "$39.01" NA NA "$74.15" NA
#> [92] NA NA NA "$45.73" NA NA "$61.70"
#> [99] NA NA
Created on 2021-10-22 by the reprex package (v2.0.1)