Home > Enterprise >  Webscraping with 'rvest' and code keeps stopping
Webscraping with 'rvest' and code keeps stopping

Time:08-31

I'm trying to webscrape some information from the following website:

https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=0001-D-2014.

I want to iterate over the bill numbers with the following code. I've run this code before on previous years and it has worked well. However, on this year it seems like the connection keeps breaking. I'm listing the code below:

summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 1:10048)


for (i in k) {
  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014"))
  type <- html_nodes(link, 'h3')
  type_text <- html_text(type)
  
  
  table <-html_node(link, "table.table.table-bordered tbody")
  
  table_text <- html_text(table)
  
  table_text <- gsub("\n", "", table_text)
  table_text <- gsub("\t", "", table_text)
  table_text <- gsub("", "", table_text)
  
  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

The errors I am getting are the following:

Error in open.connection(x, "rb") : HTTP error 500.
In addition: Warning message:
In for (i in seq_along(cenv$extra)) { :
  closing unused connection 3 (https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=0279-D-2014)

The code will stop working at certain bill links even though those links actually seem to work in isolation when I put the links into a browser. I'm not sure why this is breaking.

I tried breaking up the loop to skip the bill links that were not working but this is not an ideal solution because a) it is missing the bill links that aren't working in the code but actually have data that I want to collect, and b) it seems very inefficient.

CodePudding user response:

You could escape the error using tryCatch and add NA's to your table in those cases:

library(rvest)

summary2 <- data.frame(matrix(nrow=0, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")

for (i in k) {
  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
                   error = function(e) NA)
  
  if (is.na(link)) {
    
    type_text <- NA
    table_text <- NA
    
  } else {
  
    type <- html_nodes(link, 'h3')
    type_text <- html_text(type)
    table <-html_node(link, "table.table.table-bordered tbody")
  
    table_text <- html_text(table)
  
    table_text <- gsub("\n", "", table_text)
    table_text <- gsub("\t", "", table_text)
    table_text <- gsub("", "", table_text)
    
  }
  
  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

Output:

tibble::as_tibble(summary2)
# A tibble: 3 × 4
  billnum     sum                                                                                                           type  name_…¹
  <chr>       <chr>                                                                                                         <chr> <chr>  
1 0278-D-2014 "0278-D-2014  ProyectoSu beneplácito por el reconocimiento que la revista científica Nature realizara a un g… " PR… "ASSEF…
2 0279-D-2014 "0279-D-2014  ProyectoSu Benplacito al conmemorarase  el  natalicio de el Dr.  Joaquin V.  Gonzalezel 6 de m…  NA    NA    
3 0280-D-2014 "0280-D-2014  ProyectoLA HONORABLE CAMARA DE DIPUTADOS EXPRESA SU ADHESIÓN AL CONMEMORARSE EL 07 DE MARZO \"… " PR… "GRANA…
# … with abbreviated variable name ¹​name_dis_part
  • Related