Home > Software design >  webscrape unicef data with rvest
webscrape unicef data with rvest

Time:09-09

I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.

Upon inspecting element, there is a data-inner-wrapper class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.

The data is available by going to https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.

If anyone could help me figure out how to get all the data, that would be fantastic. Thanks

library(rvest)
library(dplyr)
library(tidyverse)

page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
  #html_nodes("div.data-inner-wrapper") 
  html_nodes(xpath = "//div[@class='data-inner-wrapper']")

EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)

This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....

https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=AFG..&startPeriod=1970&endPeriod=2022

CodePudding user response:

You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators

library(httr)
library(jsonlite)
library(glue)

## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>% 
  content(as = "text") %>% 
  jsonlite::fromJSON()

## try looking at it in your browser 
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")

You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.

GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>% 
  content(as = "text") %>% 
  read_csv()

This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html

CodePudding user response:

Here is what I did based on the very helpful code from @Datapumpernickel

library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)

## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>% 
  content(as = "text") %>% 
  jsonlite::fromJSON()

## try looking at it in your browser 
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")

tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
  # Set up a trycatch loop to keep on going when it encounters errors
  tryCatch({
    print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
    TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>% 
      content(as = "text") %>% 
      read_csv(col_types = cols())
    
    # # Basic formatting for variables I want
    TMP = TMP %>% 
      select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
      mutate(description = indicators$helixCode[i]) %>%
      rename(country = `Geographic area`,
             variablename = Indicator,
             disaggregation = Sex,
             year = TIME_PERIOD,
             value = OBS_VALUE)
    
    # rbind each indicator to the full dataframe 
    FULL_DF = FULL_DF %>% rbind(TMP)
  
  },
  error = function(cond){
    cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
    message(cond)
    return(NA)
  }
  )
}
toc()

# Save the data 
rio::export(FULL_DF, "unicef-data.csv")
  • Related