Home > Net >  No Data Scraped w/Rvest package?
No Data Scraped w/Rvest package?

Time:05-02

# Load Packages
pacman::p_load(tidyverse, rvest)

# Set URL
url <- "https://www.worldometers.info/coronavirus/"
website <- read_html(url)

# Scrape Cases Data
cases_html <- html_nodes(website, "td.sorting_1")
cases <- html_text(cases_html)

cases_html
cases

I am trying to scrape webdata with rvest, but am getting the following errors when I check my two variables here ("cases_html" and "cases"). The output for each respectively is:

> {xml_nodeset (0)}

> character(0)

I am not sure why I am getting no data scraped from this website. I have also tried using the RSelenium package like recommended in another post here, but that code also failed with an unrelated error. I figure the solution should be available within Rvest, however, and I would like to figure out what exactly is wrong here.

CodePudding user response:

It's not clear what you are trying to scrape from the page, but you can get the main data table like this:

library(tidyverse)
library(rvest)

read_html("https://www.worldometers.info/coronavirus/") %>%
  html_nodes("#main_table_countries_today") %>%
  html_table() %>%
  pluck(1)
#> # A tibble: 244 x 22
#>      `#` `Country,Other` TotalCases  NewCases   TotalDeaths NewDeaths
#>    <int> <chr>           <chr>       <chr>      <chr>       <chr>    
#>  1    NA "North America" 98,313,200  " 23,167"  1,459,752   " 147"   
#>  2    NA "Asia"          147,921,193 " 130,458" 1,423,876   " 395"   
#>  3    NA "South America" 56,801,380  " 17,492"  1,294,318   " 31"    
#>  4    NA "Europe"        191,122,646 " 187,856" 1,817,850   " 587"   
#>  5    NA "Oceania"       7,156,060   " 46,381"  10,626      " 59"    
#>  6    NA "Africa"        11,902,057  " 6,666"   253,795     " 4"     
#>  7    NA ""              721         ""         15          ""       
#>  8    NA "World"         513,217,257 " 412,020" 6,260,232   " 1,223" 
#>  9     1 "USA"           83,055,836  " 18,777"  1,020,749   " 89"    
#> 10     2 "India"         43,079,157  " 3,293"   523,803     ""       
#> # ... with 234 more rows, and 16 more variables: TotalRecovered <chr>,
#> #   NewRecovered <chr>, ActiveCases <chr>, `Serious,Critical` <chr>,
#> #   `Tot Cases/1M pop` <chr>, `Deaths/1M pop` <chr>, TotalTests <chr>,
#> #   `Tests/1M pop` <chr>, Population <chr>, Continent <chr>,
#> #   `1 Caseevery X ppl` <chr>, `1 Deathevery X ppl` <chr>,
#> #   `1 Testevery X ppl` <int>, `New Cases/1M pop` <chr>,
#> #   `New Deaths/1M pop` <dbl>, `Active Cases/1M pop` <chr>

Created on 2022-04-30 by the reprex package (v2.0.1)

  • Related