Home > OS >  Webscraping: find the node/table ID in R using inspect
Webscraping: find the node/table ID in R using inspect

Time:04-09

I am doing a web scrapping exercise where I want to get the below table using the url below:

https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory

Updated April 7, 2022 COVID-19 cases, deaths, and rates by location[5]

I right click on the browser, inspect and want to find the table ID/node which will replace the ? below in the code. I am not able to find this node.

library(tidyverse)
library(rvest)

# get the data 

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"

html_data <- read_html(url)

html_data %>%
  html_node("??") %>% # how do I get the node containing the table
  html_table() %>% 
  as_tibble()

Thank you

CodePudding user response:

Using your browser get the table's xpath and use it instead of "??".

suppressPackageStartupMessages({
  library(httr)
  library(rvest)
  library(dplyr)
})

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
xp <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[15]/div[5]/table"

html_data <- read_html(url)

html_data %>%
  html_elements(xpath = xp) %>% # how do I get the node containing the table
  html_table() %>%
  .[[1]] %>%
  select(-1)
#> # A tibble: 218 x 4
#>    Country                `Deaths / million` Deaths    Cases      
#>    <chr>                  <chr>              <chr>     <chr>      
#>  1 World[a]               783                6,166,510 495,130,920
#>  2 Peru                   6,366              212,396   3,549,511  
#>  3 Bulgaria               5,314              36,655    1,143,424  
#>  4 Bosnia and Herzegovina 4,819              15,728    375,948    
#>  5 Hungary                4,738              45,647    1,863,039  
#>  6 North Macedonia        4,433              9,234     307,142    
#>  7 Montenegro             4,308              2,706     233,523    
#>  8 Georgia                4,212              16,765    1,650,384  
#>  9 Croatia                3,833              15,646    1,105,315  
#> 10 Czech Republic         3,712              39,816    3,850,902  
#> # ... with 208 more rows

Created on 2022-04-08 by the reprex package (v2.0.1)

CodePudding user response:

Rather than a long and fragile xpath I would suggest using a more stable, faster, and descriptive css selector list. There is a specific parent id (the fastest method generally for matching on) and child table class (second fastest) combination you can use:

library(magrittr)
library(rvest)

df <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory') %>%
  html_element('#covid-19-cases-deaths-and-rates-by-location .wikitable') %>%
  html_table()

Recommended reading:

https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Practice:

https://flukeout.github.io/

  • Related