While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: NA
The code i use is as follows:
This part is for getting the full url's:
#Scrape Galaxus
vec_galaxus<-vector()
i=0
input_galaxus <- input %>%
filter(`Galaxus Artikel`!=0)
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
This is the scraping loop:
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i 1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text()%>%
str_extract("[0-9] ") %>%
as.integer())
Sys.sleep(runif(1, min=0.2, max=0.5))
}
where part of my input "input_galaxus2" looks like this:
c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
Does anybody know why suddenly this code gives me the above error message? Thanks in advance for your responses!
CodePudding user response:
If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.
Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).
Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.
out <- vector("list", length(input_galaxus2))
Prep the session. This may be repeatable depending on if you have authentication or other attributes.
sess <- session(input_galaxus2[1]) #to start the session
Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.
I don't think you need
read_html
in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.empties <- which(sapply(out, is.null)) for (i in empties) { res <- tryCatch({ sess %>% session_jump_to(input_galaxus2[i]) %>% html_nodes('.sc-1aeovxo-1.gvrGle') %>% html_text() %>% str_extract("[0-9] ") %>% as.integer() }, error = function(e) e) if (inherits(res, "error")) { warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e))) # optional Sys.sleep(3) } else out[[i]] <- res }
Note: this assumes that a
NULL
value means the previous attempt failed, was interrupted, or ... was not attempted. IfNULL
can be a valid and successful return value from your pull, then you should likely prefillout
with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you defineempties
above.
CodePudding user response:
Using purrr::map
instead of loop, without any Sys.sleep()
.
library(tidyverse)
library(rvest)
df <- tibble(
links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)
get_prices <- function(link) {
link %>%
read_html() %>%
html_nodes(".sc-1aeovxo-1.gvrGle") %>%
html_text2() %>%
str_remove_all("–")
}
df %>%
mutate(price= map(links, get_prices) %>%
as.numeric)
# A tibble: 14 × 2
links price
<chr> <dbl>
1 "https://www.galaxus.ch/15758734" 17.8
2 "https://www.galaxus.ch/7362734" 500.
3 "https://www.galaxus.ch/12073455" 173
4 "https://www.galaxus.ch/20841274" 112
5 "https://www.galaxus.ch/20589944 " 25.4
6 "https://www.galaxus.ch/13595276" 313
7 "https://www.galaxus.ch/16255768" 40
8 "https://www.galaxus.ch/6296373" 62.9
9 "https://www.galaxus.ch/14513900" 539
10 "https://www.galaxus.ch/14465626" 466.
11 "https://www.galaxus.ch/10592707" 63.5
12 "https://www.galaxus.ch/19958785" NA
13 "https://www.galaxus.ch/9858343" 7.3
14 "https://www.galaxus.ch/14513913" 617