I have a similar problem to this one. I want to download the tables for all years/months in this webpage. I have been able to download the tables that appear when opening the website using the following code:
#######
# Pages
#######
yr.list <- seq(2012,2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
c.list <- c("contrata","planta")
################################################
## UTarapaca Scraping Loop PLANTA & CONTRATA
################################################
combined_df <- data.frame()
for (c in c.list){
for (yr in yr.list){
for (mes in mes.list) {
# UChile URL
root <- "https://www.uta.cl/transparencia/"
# Full link
url <- paste(root,c,"/",yr,"/",mes,"/",sep="")
# Parse HTML File
file<-read_html(url)
# Get the nodes were the tables live
tables<-html_nodes(file, "table")
# This is the relevant table
table <- as.data.frame(html_table(tables[1], fill = TRUE))
}
Nonetheless, that code only fetches the 10 registers from the first page (Registros por pagina = 10 in the upper right corner of the table) and what I want is to download all the registers that the wrapped table contains. I tried looping over the different "table pages" (see lower right corner of the table to see pages) but the URL does not change when changing the page.
Any help on this would be greatly appreciated. Bests, Maria
CodePudding user response:
Here is a way with rvest
. First create all links outside any loop. Then lapply
an anonymous function to read each page and extract the tables from those pages.
library(httr)
library(rvest)
library(dplyr)
root <- "https://www.uta.cl/transparencia/"
c.list <- c("contrata","planta")
yr.list <- seq(2012, 2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
df_links <- expand.grid(c.list, yr.list, mes.list)
head(df_links)
links <- with(df_links, sprintf("%s%s/%s/%s", root, Var1, Var2, Var3))
length(links)
tables_list <- lapply(links, \(x) {
page <- read_html(x)
tbl_list <- page %>%
html_elements("table") %>%
html_children() %>%
html_table()
names(tbl_list[[2]]) <- names(tbl_list[[1]])
tbl_list[[2]]
})
CodePudding user response:
Libraries and data
library(tidyverse)
library(magrittr)
library(rvest)
df <- expand.grid(
yr.list = seq(2012, 2020),
mes.list = c(
"Enero",
"Febrero",
"Marzo",
"Abril",
"Mayo",
"Junio",
"Julio",
"Agosto",
"Septiembre",
"Octubre",
"Noviembre",
"Diciembre"
),
c.list = c("contrata", "planta")
) %>%
mutate(links = paste0(
"https://www.uta.cl/transparencia/",
c.list,
"/",
yr.list,
"/",
mes.list
)) %>%
as_tibble
Define a function to get the table and report it in a nested data
get_data <- function(link) {
link %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::clean_names()
}
final_df <- df %>%
slice(1:5) %>%
mutate(content = map(links, get_data))
# A tibble: 5 × 5
yr.list mes.list c.list links content
<int> <fct> <fct> <chr> <list>
1 2012 Enero contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
2 2013 Enero contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
3 2014 Enero contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
4 2015 Enero contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
5 2016 Enero contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
Unnest and view it
# A tibble: 1,614 × 25
yr.list mes.list c.list apellido_paterno apellido_materno nombres estamento
<int> <fct> <fct> <chr> <chr> <chr> <chr>
1 2012 Enero contrata ACEVEDO UBILLA CHARLIE E… TECNICO
2 2012 Enero contrata AGUAYO BURDILES CRISTIAN … AUXILIAR
3 2012 Enero contrata AGUIRRE POLLAROLO TERESA DE… ACADEMICO
4 2012 Enero contrata ALARCON HERRERA JUAN FRAN… AUXILIAR
5 2012 Enero contrata ALARCON MENESES LUIS MANU… PROFESIO…
6 2012 Enero contrata ALEGRE OSSANDON DANIEL AUXILIAR
7 2012 Enero contrata ALFONSO GAJARDO JORGE CRI… ADMINIST…
8 2012 Enero contrata ALFRED URIZAR MARIA CRI… ACADEMICO
9 2012 Enero contrata ALVAREZ FLORES KAREN PROFESIO…
10 2012 Enero contrata ALVEAL FUENTES CLAUDIA ADMINIST…
# … with 1,604 more rows, and 18 more variables: grado_erut <chr>, bienios <chr>,
# jerarquia_academica <chr>, jornada <chr>,
# calificacion_profesional_o_formacion <chr>, cargo_o_funcion <chr>,
# region <chr>, asignaciones_especiales <chr>, haberes_transitorios <chr>,
# remuneracion_bruta_segun_grado <chr>, renta_bruta_mensualizada <chr>,
# unidad_monetaria <chr>, horas_extraordinarias <chr>, fecha_inicio <chr>,
# fecha_termino <chr>, observaciones <chr>, …