I am wanting to get the download links for each of the files on the website https://datahub.itu.int/indicators/ but am struggling to get what I need.
Each indicator seems to contain a direct link to download the data in the following format https://api.datahub.itu.int/v2/data/download/byid/XXX/iscollection/YYY
where XXX
is some sort of number between 1 and 100,000 or so and YYY
is either true
or false
.
Ideally, I would like to get the link to each indicator and a corresponding name/html text of the link in one big dataframe.
I have tried to get the links for the files using rvest
and various combinations of html_nodes
and html_attrs
and xpaths. but have not had any luck. I really want to avoid running a loop and brute force 100,000 download links because that is horribly inefficient and will almost certainly cause issues for their servers.
I am not sure if there is a better way than using rvest, but any help would be most appreciated.
library(rvest)
library(httr)
library(tidyverse)
library(dplyr)
page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
html_attr("href")
CodePudding user response:
If you look at the requests the pages makes (e.g. in the browser devtools) you will find that there is a request to an api which retrieves all the link; from this you can build the urls yourself: (the other solution would be to use RSelenium
, but this would be much more complicated)
library(httr)
library(tidyverse)
GET("https://api.datahub.itu.int/v2/dictionaries/getcategories") %>%
content() %>%
map(as_tibble) %>%
bind_rows() %>%
unnest_wider(subCategory) %>%
unnest(items) %>%
unnest_wider(items) %>%
mutate(url = paste0("https://api.datahub.itu.int/v2/data/download/byid/",
codeID,
"/iscollection/",
tolower(as.character(isCollection)))) %>%
select(category, codeID, label, subCategory, isCollection, url)
#> # A tibble: 181 × 6
#> category codeID label subCategory isCollection url
#> <chr> <int> <chr> <chr> <lgl> <chr>
#> 1 Connectivity 8941 Households with a radio Access FALSE http…
#> 2 Connectivity 8965 Households with a TV Access FALSE http…
#> 3 Connectivity 100002 Households with multichan… Access TRUE http…
#> 4 Connectivity 8749 Households with telephone… Access FALSE http…
#> 5 Connectivity 20719 Individuals who own a mob… Access FALSE http…
#> 6 Connectivity 12046 Households with a computer Access FALSE http…
#> 7 Connectivity 12047 Households with Internet … Access FALSE http…
#> 8 Connectivity 100001 Households with access to… Access TRUE http…
#> 9 Connectivity 100000 Reasons for not having In… Access TRUE http…
#> 10 Connectivity 15 Fixed-telephone subscript… Access FALSE http…
#> # … with 171 more rows
Created on 2022-07-05 by the reprex package (v2.0.1)