Home > Software engineering >  if link fails try again or skip to next link
if link fails try again or skip to next link

Time:01-20

i am facing trouble need help.

i have list of links (about 9000 links) which i am running in loop and doing some process on each one

links look like this :-

link1 link2 link3 link4 ..... link9000

but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-

status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150)) but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links some sample links:-

https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789

  for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
  url_parse<- read_html(Links)
  col_name<- url_parse %>%
    html_nodes("tr") %>%
    html_text()
  col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
  pattern_col_no <- grep("využití", col_name)
  col_name <- as.data.frame(col_name)
  method_selected <- col_name[pattern_col_no,]
  WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
  #METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
  print(WRITE_CSV_DATA)
  
}else{
  print("LINK NOT WORKING")
  no_Links <- sorted_link[i,]
  not_working_link <- rbind(not_working_link,no_Links)
}

}

CodePudding user response:

It is not clear how you want the final output, but here is how to scrape and skip links that are not working

library(rvest)
library(httr2)
library(tidyverse)

Given this data frame of links, notice the third one is not working:

df <- tibble(
  links = c(
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
    "https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
  )
)

# A tibble: 4 × 1
  links                                                
  <chr>                                                
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789

Create a function to scrape the table, specifically the third row:

get_info <- function(link) {
  cat("Scraping", link, "\n")
  link %>%
    read_html() %>%
    html_table() %>%
    pluck(2) %>%
    slice(3) %>%
    pull(2) 
}

And mutate() a new column with the info, NA if the link is not working. If the link is not working possibly() will throw NA (NA_character_) back instead of stopping the code.

df %>% 
  mutate(
    info = map_chr(links, possibly(get_info, otherwise = NA_character_))
  )

# A tibble: 4 × 2
  links                                                 info       
  <chr>                                                 <chr>      
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA         
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům
  •  Tags:  
  • r
  • Related