Home > Enterprise >  scraping data from ITU download links with rvest
scraping data from ITU download links with rvest

Time:07-05

I am wanting to get the download links for each of the files on the website https://datahub.itu.int/indicators/ but am struggling to get what I need.

Each indicator seems to contain a direct link to download the data in the following format https://api.datahub.itu.int/v2/data/download/byid/XXX/iscollection/YYY where XXX is some sort of number between 1 and 100,000 or so and YYY is either true or false.

Ideally, I would like to get the link to each indicator and a corresponding name/html text of the link in one big dataframe.

I have tried to get the links for the files using rvest and various combinations of html_nodes and html_attrs and xpaths. but have not had any luck. I really want to avoid running a loop and brute force 100,000 download links because that is horribly inefficient and will almost certainly cause issues for their servers.

I am not sure if there is a better way than using rvest, but any help would be most appreciated.

library(rvest)
library(httr)
library(tidyverse)
library(dplyr)

page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
  html_attr("href")

CodePudding user response:

If you look at the requests the pages makes (e.g. in the browser devtools) you will find that there is a request to an api which retrieves all the link; from this you can build the urls yourself: (the other solution would be to use RSelenium, but this would be much more complicated)

library(httr)
library(tidyverse)

GET("https://api.datahub.itu.int/v2/dictionaries/getcategories") %>%
  content() %>%
  map(as_tibble) %>%
  bind_rows() %>%
  unnest_wider(subCategory) %>%
  unnest(items) %>%
  unnest_wider(items) %>%
  mutate(url = paste0("https://api.datahub.itu.int/v2/data/download/byid/",
                      codeID,
                      "/iscollection/",
                      tolower(as.character(isCollection)))) %>%
  select(category, codeID, label, subCategory, isCollection, url)
#> # A tibble: 181 × 6
#>    category     codeID label                      subCategory isCollection url  
#>    <chr>         <int> <chr>                      <chr>       <lgl>        <chr>
#>  1 Connectivity   8941 Households with a radio    Access      FALSE        http…
#>  2 Connectivity   8965 Households with a TV       Access      FALSE        http…
#>  3 Connectivity 100002 Households with multichan… Access      TRUE         http…
#>  4 Connectivity   8749 Households with telephone… Access      FALSE        http…
#>  5 Connectivity  20719 Individuals who own a mob… Access      FALSE        http…
#>  6 Connectivity  12046 Households with a computer Access      FALSE        http…
#>  7 Connectivity  12047 Households with Internet … Access      FALSE        http…
#>  8 Connectivity 100001 Households with access to… Access      TRUE         http…
#>  9 Connectivity 100000 Reasons for not having In… Access      TRUE         http…
#> 10 Connectivity     15 Fixed-telephone subscript… Access      FALSE        http…
#> # … with 171 more rows

Created on 2022-07-05 by the reprex package (v2.0.1)

  • Related