I wan wanting to automate downloading of some unicef data from https://data.unicef.org/indicator-profile/ using rvest or a simila r package. I have noticed that there are indicator codes, but I am having trouble identifying the correct codes and actually downloading the data.
Upon inspecting element, there is a data-inner-wrapper
class that seems like it might be useful. You can access a download link by going to a page associated with an indicator and specifying a time period. For example, CME_TMY5T9 is the code for Deaths aged 5 to 9.
The data is available by going to https://data.unicef.org/resources/data_explorer/unicef_f/?ag=UNICEF&df=GLOBAL_DATAFLOW&ver=1.0&dq=.CME_TMY5T9..&startPeriod=2017&endPeriod=2022` and then clicking a download link.
If anyone could help me figure out how to get all the data, that would be fantastic. Thanks
library(rvest)
library(dplyr)
library(tidyverse)
page = "https://data.unicef.org/indicator-profile/"
df = read_html(page) %>%
#html_nodes("div.data-inner-wrapper")
html_nodes(xpath = "//div[@class='data-inner-wrapper']")
EDIT: Alternatively, downloading all data for each country would be possible. I think that would just require getting the download link or getting at at the data within the table (since country codes arent much of an issue)
This shows all the data for Afghanistan. I just need to figure out a programmatic way of actually downloading the data....
CodePudding user response:
You are on the right track! When you visit the website https://data.unicef.org/indicator-profile/, it does not directly contain the indicator codes, because these are loaded dynamically at a later point. You can try using the "network analysis" function of your webbrowser and look at the different requests your browser does to fully load a webpage. The one you are looking for, with all the indicator codes is here: https://uni-drp-rdm-api.azurewebsites.net/api/indicators
library(httr)
library(jsonlite)
library(glue)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
You also correctly identied the URL, which lets you download individual datasets in the data browser. Now you just needed to find the one that pops up, when you actually download an excel file and recursively add in the differnt helix-codes from the indicators. I have not tried applying this to all indicators, for some the url might differ and you might get incomplete data or errors. But this should get you started.
GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[3]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv()
This might be a good place to get started on how to mimick requests that your browser executes. https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html
CodePudding user response:
Here is what I did based on the very helpful code from @Datapumpernickel
library(dplyr)
library(httr)
library(jsonlite)
library(glue)
library(tidyverse)
library(tictoc)
## this gets the indicator codes
indicators <- GET("https://uni-drp-rdm-api.azurewebsites.net/api/indicators") %>%
content(as = "text") %>%
jsonlite::fromJSON()
## try looking at it in your browser
#browseURL("https://uni-drp-rdm-api.azurewebsites.net/api/indicators")
tic()
FULL_DF = NULL
for(i in seq(1,length(unique(indicators$helixCode)),1)){
# Set up a trycatch loop to keep on going when it encounters errors
tryCatch({
print(paste0("Processing : ", i, " of 546 ", indicators$helixCode[i]))
TMP = GET(glue("https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/UNICEF,GLOBAL_DATAFLOW,1.0/.{indicators$helixCode[i]}..?startPeriod=2017&endPeriod=2022&format=csv&labels=name")) %>%
content(as = "text") %>%
read_csv(col_types = cols())
# # Basic formatting for variables I want
TMP = TMP %>%
select(`Geographic area`, Indicator, Sex, TIME_PERIOD, OBS_VALUE) %>%
mutate(description = indicators$helixCode[i]) %>%
rename(country = `Geographic area`,
variablename = Indicator,
disaggregation = Sex,
year = TIME_PERIOD,
value = OBS_VALUE)
# rbind each indicator to the full dataframe
FULL_DF = FULL_DF %>% rbind(TMP)
},
error = function(cond){
cat("\n WARNING COULD NOT PROCESS : ", i, " of 546 ", indicators$helixCode[i])
message(cond)
return(NA)
}
)
}
toc()
# Save the data
rio::export(FULL_DF, "unicef-data.csv")