Home > database >  Scrape a table that continues in next page using R
Scrape a table that continues in next page using R

Time:06-14

I'm trying to scrape the entire table of this website: https://sineb.mineducacion.gov.co/bcol/app

I need all records for the filter: Departamento:=BOGOTÁ, D.C.

I'm able to get the table on the first page, but not the rest of the table in pages 2 to 20.

library(tidyverse)
library(rvest)

sineb <- html_session("https://sineb.mineducacion.gov.co/bcol/app")
my_form <- html_form(sineb)[[1]]
dept <- my_form$fields$departamento$options[-1]
bogota <- dept[grep("D.C", names(dept))]


my_form <- set_values(my_form, 'departamento' = bogota[1])
sineb <- submit_form(sineb, my_form, "consultar")

df_list <- html_table(sineb, T, T, T)

table <- as.data.frame(df_list[[4]]) 

Thanks!

CodePudding user response:

Let me first note that I used the updated syntax of rvest (See Functions renamed in rvest 1.0.0)

Your ansatz is pretty good, and with using session_follow_link, this easily completes the solution by looping through the pages and selecting the link using xpath:

library(tidyverse)
library(rvest)

sineb <- session("https://sineb.mineducacion.gov.co/bcol/app")
my_form <- html_form(sineb)[[1]]
dept <- my_form$fields$departamento$options[-1]
bogota <- dept[grep("D.C", names(dept))]


my_form <- html_form_set(my_form, 'departamento' = bogota[1])
sineb <- session_submit(sineb, my_form, "consultar")

df_list <- html_table(sineb, T, T, T)

results <- as.data.frame(df_list[[4]])


for (next_page in 2:20) {
  sineb <- session_follow_link(sineb, xpath = paste0("//a[text() = '", next_page, "']"))
  df_list <- html_table(sineb, T, T, T)
  results <- rbind(results, as.data.frame(df_list[[4]]))
}
  • Related