I am doing a web scrapping exercise where I want to get the below table using the url below:
https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory
Updated April 7, 2022 COVID-19 cases, deaths, and rates by location[5]
I right click on the browser, inspect and want to find the table ID/node
which will replace the ?
below in the code. I am not able to find this node.
library(tidyverse)
library(rvest)
# get the data
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
html_data <- read_html(url)
html_data %>%
html_node("??") %>% # how do I get the node containing the table
html_table() %>%
as_tibble()
Thank you
CodePudding user response:
Using your browser get the table's xpath and use it instead of "??"
.
suppressPackageStartupMessages({
library(httr)
library(rvest)
library(dplyr)
})
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
xp <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[15]/div[5]/table"
html_data <- read_html(url)
html_data %>%
html_elements(xpath = xp) %>% # how do I get the node containing the table
html_table() %>%
.[[1]] %>%
select(-1)
#> # A tibble: 218 x 4
#> Country `Deaths / million` Deaths Cases
#> <chr> <chr> <chr> <chr>
#> 1 World[a] 783 6,166,510 495,130,920
#> 2 Peru 6,366 212,396 3,549,511
#> 3 Bulgaria 5,314 36,655 1,143,424
#> 4 Bosnia and Herzegovina 4,819 15,728 375,948
#> 5 Hungary 4,738 45,647 1,863,039
#> 6 North Macedonia 4,433 9,234 307,142
#> 7 Montenegro 4,308 2,706 233,523
#> 8 Georgia 4,212 16,765 1,650,384
#> 9 Croatia 3,833 15,646 1,105,315
#> 10 Czech Republic 3,712 39,816 3,850,902
#> # ... with 208 more rows
Created on 2022-04-08 by the reprex package (v2.0.1)
CodePudding user response:
Rather than a long and fragile xpath I would suggest using a more stable, faster, and descriptive css selector list. There is a specific parent id (the fastest method generally for matching on) and child table class (second fastest) combination you can use:
library(magrittr)
library(rvest)
df <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory') %>%
html_element('#covid-19-cases-deaths-and-rates-by-location .wikitable') %>%
html_table()
Recommended reading:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
Practice: