i am facing trouble need help.
i have list of links (about 9000 links) which i am running in loop and doing some process on each one
links look like this :-
link1 link2 link3 link4 ..... link9000
but i am facing trouble as sometimes link 2nd gets failed (timeout) and sometime link2nd works and 400 or any random link fails as timeout . is there any way i can try failed link again n again ? i have added :-
status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150))
but still i get timeout . please help me! or any suggestion regarding it? final_links_bind = have all list of links
some sample links:-
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
url_parse<- read_html(Links)
col_name<- url_parse %>%
html_nodes("tr") %>%
html_text()
col_name <- stringr::str_remove_all(col_name, "\\\t|\\\n|\\\r")
pattern_col_no <- grep("využití", col_name)
col_name <- as.data.frame(col_name)
method_selected <- col_name[pattern_col_no,]
WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
#METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
print(WRITE_CSV_DATA)
}else{
print("LINK NOT WORKING")
no_Links <- sorted_link[i,]
not_working_link <- rbind(not_working_link,no_Links)
}
}
CodePudding user response:
It is not clear how you want the final output, but here is how to scrape and skip links that are not working
library(rvest)
library(httr2)
library(tidyverse)
Given this data frame of links, notice the third one is not working:
df <- tibble(
links = c(
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
)
)
# A tibble: 4 × 1
links
<chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
Create a function to scrape the table, specifically the third row:
get_info <- function(link) {
cat("Scraping", link, "\n")
link %>%
read_html() %>%
html_table() %>%
pluck(2) %>%
slice(3) %>%
pull(2)
}
And mutate()
a new column with the info, NA if the link is not working. If the link is not working possibly()
will throw NA (NA_character_
) back instead of stopping the code.
df %>%
mutate(
info = map_chr(links, possibly(get_info, otherwise = NA_character_))
)
# A tibble: 4 × 2
links info
<chr> <chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům