Home > database >  Webscraping using Rvest for wrapped tables
Webscraping using Rvest for wrapped tables

Time:08-06

I have a similar problem to this one. I want to download the tables for all years/months in this webpage. I have been able to download the tables that appear when opening the website using the following code:

#######
# Pages 
#######
yr.list <- seq(2012,2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")
c.list <- c("contrata","planta")

################################################
## UTarapaca Scraping Loop PLANTA & CONTRATA
################################################

combined_df <- data.frame()
for (c in c.list){
for (yr in yr.list){
  for (mes in mes.list) {
      # UChile URL
      root <- "https://www.uta.cl/transparencia/"
      
      # Full link    
      url <- paste(root,c,"/",yr,"/",mes,"/",sep="")
      
      # Parse HTML File
      file<-read_html(url)
      
      # Get the nodes were the tables live
      tables<-html_nodes(file, "table")
      
      # This is the relevant table
      table <- as.data.frame(html_table(tables[1], fill = TRUE))
    }

Nonetheless, that code only fetches the 10 registers from the first page (Registros por pagina = 10 in the upper right corner of the table) and what I want is to download all the registers that the wrapped table contains. I tried looping over the different "table pages" (see lower right corner of the table to see pages) but the URL does not change when changing the page.

Any help on this would be greatly appreciated. Bests, Maria

CodePudding user response:

Here is a way with rvest. First create all links outside any loop. Then lapply an anonymous function to read each page and extract the tables from those pages.

library(httr)
library(rvest)
library(dplyr)

root <- "https://www.uta.cl/transparencia/"
c.list <- c("contrata","planta")
yr.list <- seq(2012, 2020)
mes.list <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")

df_links <- expand.grid(c.list, yr.list, mes.list)
head(df_links)

links <- with(df_links, sprintf("%s%s/%s/%s", root, Var1, Var2, Var3))
length(links)

tables_list <- lapply(links, \(x) {
  page <- read_html(x)
  tbl_list <- page %>%
    html_elements("table") %>%
    html_children() %>%
    html_table()
  names(tbl_list[[2]]) <- names(tbl_list[[1]])
  tbl_list[[2]]
})

CodePudding user response:

Libraries and data

library(tidyverse)
library(magrittr)
library(rvest)

df <- expand.grid(
  yr.list = seq(2012, 2020),
  mes.list = c(
    "Enero",
    "Febrero",
    "Marzo",
    "Abril",
    "Mayo",
    "Junio",
    "Julio",
    "Agosto",
    "Septiembre",
    "Octubre",
    "Noviembre",
    "Diciembre"
  ),
  c.list = c("contrata", "planta")
) %>%
  mutate(links = paste0(
    "https://www.uta.cl/transparencia/",
    c.list,
    "/",
    yr.list,
    "/",
    mes.list
  )) %>% 
  as_tibble

Define a function to get the table and report it in a nested data

get_data <- function(link) {
  link %>%
    read_html() %>%
    html_table() %>%
    getElement(1) %>%
    janitor::clean_names()
}

final_df <- df %>%
  slice(1:5) %>% 
  mutate(content = map(links, get_data))

# A tibble: 5 × 5
  yr.list mes.list c.list   links                                          content 
    <int> <fct>    <fct>    <chr>                                          <list>  
1    2012 Enero    contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
2    2013 Enero    contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
3    2014 Enero    contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
4    2015 Enero    contrata https://www.uta.cl/transparencia/contrata/201… <tibble>
5    2016 Enero    contrata https://www.uta.cl/transparencia/contrata/201… <tibble>

Unnest and view it

# A tibble: 1,614 × 25
   yr.list mes.list c.list   apellido_paterno apellido_materno nombres    estamento
     <int> <fct>    <fct>    <chr>            <chr>            <chr>      <chr>    
 1    2012 Enero    contrata ACEVEDO          UBILLA           CHARLIE E… TECNICO  
 2    2012 Enero    contrata AGUAYO           BURDILES         CRISTIAN … AUXILIAR 
 3    2012 Enero    contrata AGUIRRE          POLLAROLO        TERESA DE… ACADEMICO
 4    2012 Enero    contrata ALARCON          HERRERA          JUAN FRAN… AUXILIAR 
 5    2012 Enero    contrata ALARCON          MENESES          LUIS MANU… PROFESIO…
 6    2012 Enero    contrata ALEGRE           OSSANDON         DANIEL     AUXILIAR 
 7    2012 Enero    contrata ALFONSO          GAJARDO          JORGE CRI… ADMINIST…
 8    2012 Enero    contrata ALFRED           URIZAR           MARIA CRI… ACADEMICO
 9    2012 Enero    contrata ALVAREZ          FLORES           KAREN      PROFESIO…
10    2012 Enero    contrata ALVEAL           FUENTES          CLAUDIA    ADMINIST…
# … with 1,604 more rows, and 18 more variables: grado_erut <chr>, bienios <chr>,
#   jerarquia_academica <chr>, jornada <chr>,
#   calificacion_profesional_o_formacion <chr>, cargo_o_funcion <chr>,
#   region <chr>, asignaciones_especiales <chr>, haberes_transitorios <chr>,
#   remuneracion_bruta_segun_grado <chr>, renta_bruta_mensualizada <chr>,
#   unidad_monetaria <chr>, horas_extraordinarias <chr>, fecha_inicio <chr>,
#   fecha_termino <chr>, observaciones <chr>, …
  • Related