Home > OS >  Identify all excel files from a webpage with Rvest
Identify all excel files from a webpage with Rvest

Time:08-04

mi problem is very similar to this one. I want to identify all excel files in this website so I can then download them using download.file. I have tried several variations with no result, and I think this relates to the use of html_elements and html_attr. For some reason, when trying to select the specific links using the following code excel_links is empty:

url <- "https://www.portaltransparencia.cl/PortalPdT/directorio-de-organismos-regulados/?org=UN007"

    read_html(url) |> 
    html_elements("a") |>  
    html_attr(href)) -> excel_links

Any help will be greatly appreciated. Bests, Maria

CodePudding user response:

Following the detail in the comments it seems you want the Excel files from within "04. Personal y remuneraciones".

The folders housing the Excel files are public facing and so you can simply use the parent folder URI, extract the links for the child year folders, then from those extract the Excel file links.

library(magrittr)
library(rvest)

year_folder <- "https://transparencia.uv.cl/documentos/personal/remuneraciones/contrata/"

extract_links <- function(parent_link, css_selector_list) {
  links <- read_html(parent_link) %>%
    html_elements(css_selector_list) %>%
    html_attr("href") %>%
    paste0(parent_link, .)
  return(links)
}

folders_by_year <- extract_links(year_folder, 'li [href$="/"]:not([href^="/"])')

excel_files <- lapply(folders_by_year, extract_links, 'li [href$=".xls"]')
  • Related