curl error (Could not resolve host: NA) while scraping in a loop-CodePudding

While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:

Error in curl::curl_fetch_memory(url, handle = handle) : 


Could not resolve host: NA

The code i use is as follows:

This part is for getting the full url's:

   #Scrape Galaxus
vec_galaxus<-vector()
i=0

input_galaxus <- input %>%
  filter(`Galaxus Artikel`!=0)


input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)

This is the scraping loop:

sess <- session(input_galaxus2[1])             #to start the session
for (j in input_galaxus2){
  sess <- sess %>% session_jump_to(j)         #jump to URL
  
  i=i 1
  try(vec_galaxus[i] <- read_html(sess) %>%   #can read direct from sess
        html_nodes('.sc-1aeovxo-1.gvrGle') %>%
        html_text()%>%
        str_extract("[0-9] ") %>%
        as.integer())
  Sys.sleep(runif(1, min=0.2, max=0.5))
}

where part of my input "input_galaxus2" looks like this:

c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734", 
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274", 
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276", 
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373", 
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626", 
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785", 
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")

Does anybody know why suddenly this code gives me the above error message? Thanks in advance for your responses!

CodePudding user response：

If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.

Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).

Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.
```
out <- vector("list", length(input_galaxus2))
```
Prep the session. This may be repeatable depending on if you have authentication or other attributes.
```
sess <- session(input_galaxus2[1])             #to start the session
```
Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.

I don't think you need read_html in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.
```
empties <- which(sapply(out, is.null))
for (i in empties) {
  res <- tryCatch({
    sess %>%
      session_jump_to(input_galaxus2[i]) %>%
      html_nodes('.sc-1aeovxo-1.gvrGle') %>%
      html_text() %>%
      str_extract("[0-9] ") %>%
      as.integer()
  }, error = function(e) e)
  if (inherits(res, "error")) {
    warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e)))
    # optional
    Sys.sleep(3)
  } else out[[i]] <- res
}
```
Note: this assumes that a NULL value means the previous attempt failed, was interrupted, or ... was not attempted. If NULL can be a valid and successful return value from your pull, then you should likely prefill out with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you define empties above.

CodePudding user response：

Using purrr::map instead of loop, without any Sys.sleep().

library(tidyverse)
library(rvest)

df <- tibble(
  links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734", 
            "https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274", 
            "https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276", 
            "https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373", 
            "https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626", 
            "https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785", 
            "https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)

get_prices <- function(link) {
  link %>% 
    read_html() %>%
    html_nodes(".sc-1aeovxo-1.gvrGle") %>%
    html_text2() %>% 
    str_remove_all("–")
}

df %>%  
  mutate(price= map(links, get_prices) %>% 
           as.numeric) 

# A tibble: 14 × 2
   links                              price
   <chr>                              <dbl>
 1 "https://www.galaxus.ch/15758734"   17.8
 2 "https://www.galaxus.ch/7362734"   500. 
 3 "https://www.galaxus.ch/12073455"  173  
 4 "https://www.galaxus.ch/20841274"  112  
 5 "https://www.galaxus.ch/20589944 "  25.4
 6 "https://www.galaxus.ch/13595276"  313  
 7 "https://www.galaxus.ch/16255768"   40  
 8 "https://www.galaxus.ch/6296373"    62.9
 9 "https://www.galaxus.ch/14513900"  539  
10 "https://www.galaxus.ch/14465626"  466. 
11 "https://www.galaxus.ch/10592707"   63.5
12 "https://www.galaxus.ch/19958785"   NA  
13 "https://www.galaxus.ch/9858343"     7.3
14 "https://www.galaxus.ch/14513913"  617