Home > Mobile >  Read HTML table with rvest sometimes stuck and produce TimeOut Error
Read HTML table with rvest sometimes stuck and produce TimeOut Error

Time:11-11

I had to read Dollar Rates table for each Bank from https://kursdollar.org, and I had to test this Snippets in several times:

library(stringr)
library(tidyverse)
library(rvest)
library(httr)
library(RCurl)
  
curlSetOpt(timeout = 200)
  
kurs_bi <- "https://kursdollar.org/bank/bi.php"
kurs_mandiri <- "https://kursdollar.org/bank/mandiri.php"
kurs_bca <- "https://kursdollar.org/bank/bca.php"
kurs_bni <- "https://kursdollar.org/bank/bni.php"
kurs_hsbc <- "https://kursdollar.org/bank/hsbc.php"
kurs_panin <- "https://kursdollar.org/bank/panin.php"
kurs_cimb <- "https://kursdollar.org/bank/cimb.php"
kurs_ocbc <- "https://kursdollar.org/bank/ocbc.php"
kurs_bri <- "https://kursdollar.org/bank/bri.php"
kurs_uob <- "https://kursdollar.org/bank/uob.php"
kurs_maybank <- 'https://kursdollar.org/bank/maybank.php'
kurs_permata <- "https://kursdollar.org/bank/permata.php"
kurs_mega <- "https://kursdollar.org/bank/mega.php"
kurs_danamon <- "https://kursdollar.org/bank/danamon.php"
kurs_btn <- "https://kursdollar.org/bank/btn.php"
kurs_mayapada <- "https://kursdollar.org/bank/mayapada.php"
kurs_muamalat <- "https://kursdollar.org/bank/muamalat.php"
kurs_bukopin <- "https://kursdollar.org/bank/bukopin.php"
  
link_kurs <- c(kurs_bi, kurs_mandiri, kurs_bca, kurs_bni, kurs_hsbc, kurs_panin, 
kurs_cimb, kurs_ocbc, kurs_bri, kurs_uob, kurs_maybank, kurs_permata, kurs_mega, 
kurs_danamon, kurs_btn, kurs_mayapada, kurs_muamalat, kurs_bukopin)

for(v in 1:length(link_kurs)){
    writeLines(paste0(v,') Read Table on ', link_kurs[v]))
    open_url <- url(link_kurs[v], "rb")
    extract_df <- read_html(open_url) 
    close(open_url)
    extract_df <- extract_df %>%
      html_nodes("table") %>% 
      html_table(fill = T) %>% as.data.frame()
    writeLines("Test Read Success!")
  }

The result might differ when run by several times, It is fast when the reading is Successful, but sometimes it Stucks to read a certain Link (Timeout Limiting from RCurl didn't work) and throws:

Error in url(link_kurs[v], "rb") : cannot open the connection
In addition: Warning message:
In url(link_kurs[v], "rb") :
  InternetOpenUrl failed: 'The operation timed out'

Anyway to bypass this? is there a way to read all those tables consistently even if its a little slow?

CodePudding user response:

Try using tryCatch

for(v in 1:length(link_kurs)){
  writeLines(paste0(v,') Read Table on ', link_kurs[v]))
  open_url <- url(link_kurs[v], "rb")
  tryCatch({
    extract_df <- read_html(open_url) 
  close(open_url)
  extract_df <- extract_df %>%
    html_nodes("table") %>% 
    html_table(fill = T) %>% as.data.frame()
  writeLines("Test Read Success!") 
  }, error=function(e) NULL)
}

Completed version of tryCatch and Loop to retry catch the table an infinite attempts (OP Edit)

for(v in 1:length(link_kurs)){
    writeLines(paste0(v,') Read Table on ', link_kurs[v]))
    while(TRUE){
      tryCatch({
        open_url <- url(link_kurs[v], "rb")
        extract_df <- read_html(open_url) 
        close(open_url)
        extract_df <- extract_df %>%
          html_nodes("table") %>% 
          html_table(fill = T) %>% as.data.frame()
        extract_df_list <- c(extract_df_list, list(extract_df))
        writeLines("Test Read Success!") 
        break
      }, error=function(e){
        message("Test Read Timeout")
        message("Retrying. .")
      })
    }
  }
  • Related