I'm trying to webscrape some information from the following website:
https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=0001-D-2014.
I want to iterate over the bill numbers with the following code. I've run this code before on previous years and it has worked well. However, on this year it seems like the connection keeps breaking. I'm listing the code below:
summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 1:10048)
for (i in k) {
webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
billno <- html_nodes(webpage, 'h1')
billno_text <- html_text(billno)
billsum <- html_nodes(webpage, '.interno')
billsum_text <- html_text(billsum)
billsum_text <- gsub("\n", "", billsum_text)
billsum_text <- gsub("\t", "", billsum_text)
billsum_text <- gsub(" ", "", billsum_text)
link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014"))
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
summary2[i, 1] <- billno_text
summary2[i, 2] <- billsum_text
summary2[i, 3] <- type_text
summary2[i, 4] <- table_text
}
The errors I am getting are the following:
Error in open.connection(x, "rb") : HTTP error 500.
In addition: Warning message:
In for (i in seq_along(cenv$extra)) { :
closing unused connection 3 (https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=0279-D-2014)
The code will stop working at certain bill links even though those links actually seem to work in isolation when I put the links into a browser. I'm not sure why this is breaking.
I tried breaking up the loop to skip the bill links that were not working but this is not an ideal solution because a) it is missing the bill links that aren't working in the code but actually have data that I want to collect, and b) it seems very inefficient.
CodePudding user response:
You could escape the error using tryCatch
and add NA
's to your table in those cases:
library(rvest)
summary2 <- data.frame(matrix(nrow=0, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")
for (i in k) {
webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
billno <- html_nodes(webpage, 'h1')
billno_text <- html_text(billno)
billsum <- html_nodes(webpage, '.interno')
billsum_text <- html_text(billsum)
billsum_text <- gsub("\n", "", billsum_text)
billsum_text <- gsub("\t", "", billsum_text)
billsum_text <- gsub(" ", "", billsum_text)
link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
error = function(e) NA)
if (is.na(link)) {
type_text <- NA
table_text <- NA
} else {
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
}
summary2[i, 1] <- billno_text
summary2[i, 2] <- billsum_text
summary2[i, 3] <- type_text
summary2[i, 4] <- table_text
}
Output:
tibble::as_tibble(summary2)
# A tibble: 3 × 4
billnum sum type name_…¹
<chr> <chr> <chr> <chr>
1 0278-D-2014 "0278-D-2014 ProyectoSu beneplácito por el reconocimiento que la revista científica Nature realizara a un g… " PR… "ASSEF…
2 0279-D-2014 "0279-D-2014 ProyectoSu Benplacito al conmemorarase el natalicio de el Dr. Joaquin V. Gonzalezel 6 de m… NA NA
3 0280-D-2014 "0280-D-2014 ProyectoLA HONORABLE CAMARA DE DIPUTADOS EXPRESA SU ADHESIÓN AL CONMEMORARSE EL 07 DE MARZO \"… " PR… "GRANA…
# … with abbreviated variable name ¹name_dis_part