I am working with the R programming language.
Suppose I have the following list of websites - there are 3 "good websites" (that work) and 2 "bad websites" (that don't work - this will "crash" the LOOP). This looks something like this:
# list of website (called "links"):
head(links)
[1] "https://www.good_website_1.com"
[2] "https://www.bad_website_1.com"
[3] "https://www.good_website_2.com"
[4] "https://www.good_website_3.com"
[5] "https://www.bad_website_2.com"
> summary(links)
Length Class Mode
5 character character
> str(links)
chr [1:352] "https://good_website_1.com/" ...
For these links, I am running the following LOOP:
length = length(links)
results <- list()
for (i in 1:length)
{
url_i<-links[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("div.one_third")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
mx_i <- max(sapply(listanswer_i , length))
df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
else c(x , rep(NA , mx_i - length(x)))))
df_i <- data.frame(df_i)
results[[i]] <- df_i
print(results)
}
The above code will always "crash (i.e. break)" since there are websites that do not work:
Error in data.frame(...)
arguments imply differing number of rows: 1, 0
Normally, I fix this by manually selecting the "good" links - and then re-run the LOOP for these "good links":
good_list = links[c(1,3,4)]
links = good_list
length = length(links)
results <- list()
for (i in 1:length)
{
url_i<-links[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("div.one_third")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
mx_i <- max(sapply(listanswer_i , length))
df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
else c(x , rep(NA , mx_i - length(x)))))
df_i <- data.frame(df_i)
results[[i]] <- df_i
print(results)
}
My Question: But could something have been done that "forces" the LOOP to keep running and simply place a "row of NA's" in the "results data frame" when a "bad website" is encountered? This would save me the time of manually scanning through the original list of links and testing if each link is "good" or "bad".
Thanks!
CodePudding user response:
You can wrap the code in a try_catch statement. Let me know if this works:
for (i in 1:length)
{
tryCatch({
url_i<-links[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("div.one_third")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
mx_i <- max(sapply(listanswer_i , length))
df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
else c(x , rep(NA , mx_i - length(x)))))
df_i <- data.frame(df_i)
results[[i]] <- df_i
print(results)}, error = function(e) {df_i <- do.call(c(x , rep(NA , mx_i - length(x)))))
df_i <- data.frame(df_i)
results[[i]] <- df_i})
}
CodePudding user response:
You can actually follow a more tidy approach of running a loop without breaking even when an error happens using functions from {purrr}
package. To explain this with an example, I will try to read a bunch of CSV file with good and bad ( I mean correct and incorrect) filepath. (I wish I could present this idea with webscrapping like yours, but unfortunately couldn't think of an example right now)
Disclaimer: Sorry for this long answer, just wanted to give the idea of using safely()
with map()
from {purrr}
package and hope that I hadn't rather made this hard to understand.
Say I have some csv files in a folder named test
and I want to read them and take only the 1st five rows of the data and then bind them in a single dataframe.
So in the test
folder, I have these files.
list.files("test", full.names = TRUE)
[1] "test/mtcars1.csv" "test/mtcars2.csv"
So I create the paths to read them and write a function that will take a path of a file, read that csv file and return only the head of the data.
paths <- c("test/mtcars1.csv", "test/mtcars2.csv", "test/mtcars3.csv")
# Note that I have also inserted a incorrect path, because there's no mtcars3.csv
data_head_reading <- function(path) {
df <- readr::read_csv(path, progress = FALSE, show_col_types = FALSE)
df <- df %>% head(5)
return(df)
}
Now instead of using explicit for-loop, I can use map()
from {purrr}
package, to apply that functions over a vector of filepaths and map()
will return a list of data.frames.
map(.x = paths, .f = data_head_reading)
# Error: 'test/mtcars3.csv' does not exist in current
# working directory
But instead of returning anything, this gives us an error which is obvious because we did give a bad path (incorrect path) and unfortunately we also couldn't get the output of two rightly given paths.
So there's actually a way of running the loop even when an error occurs, which is by using safely()
from {purrr}
. So to do that, now we will wrap our data_head_reading
with safely()
, so that it will catch both the return and error that happens. To show what I mean,
safe_data_head_reading <- safely(data_head_reading) # wrapping
all_data <- map(.x = paths, .f = safe_data_head_reading) # Now trying again but with the wrapped function
all_data
[[1]]
[[1]]$result
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1
2 21 6 160 110 3.9 2.88 17.0 0 1
3 22.8 4 108 93 3.85 2.32 18.6 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0
# … with 2 more variables: gear <dbl>, carb <dbl>
[[1]]$error
NULL
[[2]]
[[2]]$result
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 95.1 113 3.77 1.51 16.9 1 1
2 15.8 8 351 264 4.22 3.17 14.5 0 1
3 19.7 6 145 175 3.62 2.77 15.5 0 1
4 15 8 301 335 3.54 3.57 14.6 0 1
5 21.4 4 121 109 4.11 2.78 18.6 1 1
# … with 2 more variables: gear <dbl>, carb <dbl>
[[2]]$error
NULL
[[3]]
[[3]]$result
NULL
[[3]]$error
<simpleError: 'test/mtcars3.csv' does not exist
in current working directory.>
Now actually the loop runs (map function) and gives us both the error and correctly returned output, from which we can get the correctly returned output just by getting those result
s, which we can do using map()
, like this,
data_without_err <- all_data %>%
map("result") %>%
bind_rows(.id = "df")
> data_without_err
# A tibble: 10 × 12
df mpg cyl disp hp drat wt qsec vs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0
2 1 21 6 160 110 3.9 2.88 17.0 0
3 1 22.8 4 108 93 3.85 2.32 18.6 1
4 1 21.4 6 258 110 3.08 3.22 19.4 1
5 1 18.7 8 360 175 3.15 3.44 17.0 0
6 2 30.4 4 95.1 113 3.77 1.51 16.9 1
7 2 15.8 8 351 264 4.22 3.17 14.5 0
8 2 19.7 6 145 175 3.62 2.77 15.5 0
9 2 15 8 301 335 3.54 3.57 14.6 0
10 2 21.4 4 121 109 4.11 2.78 18.6 1
# … with 3 more variables: am <dbl>, gear <dbl>,
# carb <dbl>
This approach will also work with urls too. Just simply write a function that will receive a url (link) and read that link and do what necessary using function from rvest
and wrap that function with safely()
. And then simply map that safely wrapped function over the a list of all links (good and bad) using map()
, and then simply extract the result
from the mapped ouput.