Home > front end >  Missing value addition when scraping in R
Missing value addition when scraping in R

Time:10-23

I'm trying web scrape using R and have encountered some missing values when scraping gross revenue from the IMDB movies data. May I know how could I automatically insert NA if the movie's gross revenue is unknown?

webpage <- read_html('https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature')

gross <- html_nodes(webpage,'.ghost~ .text-muted  span')
gross <- html_text(gross_data)

CodePudding user response:

One option to achieve your desired result would be to select the nodes with the items first and to extract the revenue and/or other info from the single nodes (Thanks to @Dave2e for pointing out that using purrr::map_dfr is not necessary). If you want to extract multiple pieces of information then I would suggest to put everything inside a data.frame:

library(rvest)
library(magrittr)

webpage <- read_html("https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature")

result <- data.frame(
  name = webpage %>% html_nodes(".lister-item-content") %>% html_node("h3 a") %>% html_text(),
  gross = webpage %>% html_nodes(".lister-item-content") %>% html_node(".ghost~ .text-muted  span") %>% html_text()
)

head(result, 15)
#>                                    name    gross
#> 1    Knives Out: Mord ist Familiensache $165.36M
#> 2                              Parasite  $53.37M
#> 3                             Midsommar  $27.33M
#> 4                                 Joker $335.45M
#> 5      Once Upon a Time In... Hollywood $142.50M
#> 6                     Die Addams Family $100.04M
#> 7                     Avengers: Endgame $858.37M
#> 8                        Der Leuchtturm   $0.43M
#> 9  Stephen Kings Doctor Sleeps Erwachen     <NA>
#> 10                                 1917 $159.23M
#> 11                         Little Women $108.10M
#> 12                         Es Kapitel 2 $211.59M
#> 13                             The King     <NA>
#> 14                        The Gentlemen     <NA>
#> 15                       Captain Marvel $426.83M

Created on 2021-10-22 by the reprex package (v2.0.1)

CodePudding user response:

Trial and error

library(tidyverse)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

webpage <- read_html('https://www.imdb.com/search/title/?count=100&release_date=2019,2019&title_type=feature')

movies_link <- html_elements(webpage, r"(.sort-num_votes-visible)")

movies_link |> html_text2() |> str_extract("\\$([0-9,.] )")
#>   [1] "$165.36" "$53.37"  "$27.33"  "$335.45" "$142.50" "$100.04" "$858.37"
#>   [8] "$0.43"   NA        "$159.23" "$108.10" "$211.59" NA        NA       
#>  [15] "$426.83" "$26.74"  "$0.35"   "$515.20" "$390.53" "$12.14"  "$316.83"
#>  [22] "$96.37"  "$140.37" "$57.01"  "$171.02" "$175.08" "$96.85"  "$62.74" 
#>  [29] "$7.00"   "$85.71"  NA        "$173.96" NA        NA        "$26.80" 
#>  [36] "$0.40"   "$117.62" "$73.29"  "$29.21"  "$6.56"   "$22.96"  "$7.74"  
#>  [43] "$355.56" NA        NA        NA        "$35.40"  "$62.25"  "$20.55" 
#>  [50] NA        "$69.06"  "$22.68"  NA        "$45.37"  "$2.00"   "$110.50"
#>  [57] "$543.64" NA        NA        "$80.00"  NA        NA        "$111.05"
#>  [64] NA        "$65.85"  NA        NA        "$3.76"   NA        NA       
#>  [71] NA        NA        NA        "$477.37" "$80.55"  "$67.16"  NA       
#>  [78] NA        "$434.04" "$21.90"  NA        NA        NA        "$113.93"
#>  [85] "$5.33"   "$13.12"  "$39.01"  NA        NA        "$74.15"  NA       
#>  [92] NA        NA        NA        "$45.73"  NA        NA        "$61.70" 
#>  [99] NA        NA

Created on 2021-10-22 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related