I have a series of websites that I want to scrape with the same format (e.g. www.website1.com, www.website2.com, www.website3.com, etc.)
I also have the following code that I am using for webscraping the JSON from each website - I want to use a "timeout" function that forces the loop to skip to the next iteration if a certain amount of time has elapsed (e.g. 1 second):
library(R.utils)
library(jsonlite)
res <- list()
part1 = "https://website"
part2 = "extension="
part3 = ".com"
res <- list()
for (i in 1:10) {
print(i)
tryCatch({
res[[i]] <- withTimeout({
url_i <- paste0(part1, i 1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
res[[i]] <- r_i
print(i)
}, timeout=1)
},
TimeoutException=function(ex) {
message("Timeout. Skipping.")
res[[i]] <- NULL
})
}
The loop seems to run - but the results are empty:
> res
[[1]]
NULL
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
[[6]]
[1] 6
[[7]]
[1] 7
[[8]]
[1] 8
[[9]]
[1] 9
[[10]]
[1] 10
However, the "r_i" objects seem to have run. For example, if I inspect "r_i", its producing a valid result.
I can't seem to understand why "r_i" is working, but storing the results in a list is not working.
Does anyone have any ideas about this?
CodePudding user response:
The timeout was too low set it higher (maybe 10
is too much). You had two res[[i]] <-
in your code. It should be just left of R.utils::withTimeout()
. Also the last thing of withTimeout
should throw the "data.frame"
, before it stored just the i
s from the print
which should be one line above. Maybe cat(i, '\r')
is better, it doesn't clutter the console and looks somewhat cool.
library(R.utils)
library(jsonlite)
for (i in 1:10) {
tryCatch({
res[[i]] <- withTimeout({
url_i <- paste0(part1, i 1, part2, i, part3)
cat(i, '\r')
data.frame(fromJSON(url_i))
}, timeout=10)
},
TimeoutException=function(ex) {
message("Timeout. Skipping.")
res[[i]] <- NA
})
}
Not sure which kind of output you exactly want, at least res
contains something now.