I'm trying to scrape the entire table of this website: https://sineb.mineducacion.gov.co/bcol/app
I need all records for the filter: Departamento:=BOGOTÁ, D.C.
I'm able to get the table on the first page, but not the rest of the table in pages 2 to 20.
library(tidyverse)
library(rvest)
sineb <- html_session("https://sineb.mineducacion.gov.co/bcol/app")
my_form <- html_form(sineb)[[1]]
dept <- my_form$fields$departamento$options[-1]
bogota <- dept[grep("D.C", names(dept))]
my_form <- set_values(my_form, 'departamento' = bogota[1])
sineb <- submit_form(sineb, my_form, "consultar")
df_list <- html_table(sineb, T, T, T)
table <- as.data.frame(df_list[[4]])
Thanks!
CodePudding user response:
Let me first note that I used the updated syntax of rvest
(See Functions renamed in rvest 1.0.0)
Your ansatz is pretty good, and with using session_follow_link
, this easily completes the solution by looping through the pages and selecting the link using xpath
:
library(tidyverse)
library(rvest)
sineb <- session("https://sineb.mineducacion.gov.co/bcol/app")
my_form <- html_form(sineb)[[1]]
dept <- my_form$fields$departamento$options[-1]
bogota <- dept[grep("D.C", names(dept))]
my_form <- html_form_set(my_form, 'departamento' = bogota[1])
sineb <- session_submit(sineb, my_form, "consultar")
df_list <- html_table(sineb, T, T, T)
results <- as.data.frame(df_list[[4]])
for (next_page in 2:20) {
sineb <- session_follow_link(sineb, xpath = paste0("//a[text() = '", next_page, "']"))
df_list <- html_table(sineb, T, T, T)
results <- rbind(results, as.data.frame(df_list[[4]]))
}