R: Running a LOOP without "Breaks"?-CodePudding

I am working with the R programming language.

Suppose I have the following list of websites - there are 3 "good websites" (that work) and 2 "bad websites" (that don't work - this will "crash" the LOOP). This looks something like this:

# list of website (called "links"):

head(links)
[1] "https://www.good_website_1.com"                                      
[2] "https://www.bad_website_1.com"  
[3] "https://www.good_website_2.com"  
[4] "https://www.good_website_3.com"              
[5] "https://www.bad_website_2.com"  

> summary(links)
   Length     Class      Mode 
      5 character character 

> str(links)
 chr [1:352] "https://good_website_1.com/" ...

For these links, I am running the following LOOP:

length = length(links)

results <- list()

for (i in 1:length)
   
{
   
    url_i<-links[i]
    page_i <-read_html(url_i)
  
    b_i = page_i %>% html_nodes("div.one_third")
   
    listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
   
    mx_i <- max(sapply(listanswer_i , length))
   
    df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
                                   else c(x , rep(NA , mx_i - length(x)))))
   
    df_i <- data.frame(df_i)
    results[[i]] <- df_i
    print(results)
}

The above code will always "crash (i.e. break)" since there are websites that do not work:

Error in data.frame(...)
 arguments imply differing number of rows: 1, 0

Normally, I fix this by manually selecting the "good" links - and then re-run the LOOP for these "good links":

good_list = links[c(1,3,4)]
links = good_list

length = length(links)

results <- list()

for (i in 1:length)

{

    url_i<-links[i]
    page_i <-read_html(url_i)

    b_i = page_i %>% html_nodes("div.one_third")

    listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")

    mx_i <- max(sapply(listanswer_i , length))

    df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
                                   else c(x , rep(NA , mx_i - length(x)))))

    df_i <- data.frame(df_i)
    results[[i]] <- df_i
    print(results)
}

My Question: But could something have been done that "forces" the LOOP to keep running and simply place a "row of NA's" in the "results data frame" when a "bad website" is encountered? This would save me the time of manually scanning through the original list of links and testing if each link is "good" or "bad".

Thanks!

CodePudding user response：

You can wrap the code in a try_catch statement. Let me know if this works:

for (i in 1:length)

{
    tryCatch({
    url_i<-links[i]
    page_i <-read_html(url_i)

    b_i = page_i %>% html_nodes("div.one_third")

    listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")

    mx_i <- max(sapply(listanswer_i , length))

    df_i <- do.call(rbind , lapply(listanswer_i , \(x) if(length(x) == mx_i) x
                                   else c(x , rep(NA , mx_i - length(x)))))

    df_i <- data.frame(df_i)
    results[[i]] <- df_i
    print(results)}, error = function(e) {df_i <- do.call(c(x , rep(NA , mx_i - length(x)))))

    df_i <- data.frame(df_i)
    results[[i]] <- df_i})
}

CodePudding user response：

You can actually follow a more tidy approach of running a loop without breaking even when an error happens using functions from {purrr} package. To explain this with an example, I will try to read a bunch of CSV file with good and bad ( I mean correct and incorrect) filepath. (I wish I could present this idea with webscrapping like yours, but unfortunately couldn't think of an example right now)

Disclaimer: Sorry for this long answer, just wanted to give the idea of using safely() with map() from {purrr} package and hope that I hadn't rather made this hard to understand.

Say I have some csv files in a folder named test and I want to read them and take only the 1st five rows of the data and then bind them in a single dataframe.

So in the test folder, I have these files.

list.files("test", full.names = TRUE)

[1] "test/mtcars1.csv" "test/mtcars2.csv"

So I create the paths to read them and write a function that will take a path of a file, read that csv file and return only the head of the data.

paths <- c("test/mtcars1.csv", "test/mtcars2.csv", "test/mtcars3.csv")
# Note that I have also inserted a incorrect path, because there's no mtcars3.csv

data_head_reading <- function(path) {
  df <- readr::read_csv(path, progress = FALSE, show_col_types = FALSE)
  df <- df %>% head(5)
  return(df)
}

Now instead of using explicit for-loop, I can use map() from {purrr} package, to apply that functions over a vector of filepaths and map() will return a list of data.frames.

map(.x = paths, .f = data_head_reading)

# Error: 'test/mtcars3.csv' does not exist in current 
# working directory

But instead of returning anything, this gives us an error which is obvious because we did give a bad path (incorrect path) and unfortunately we also couldn't get the output of two rightly given paths.

So there's actually a way of running the loop even when an error occurs, which is by using safely() from {purrr}. So to do that, now we will wrap our data_head_reading with safely(), so that it will catch both the return and error that happens. To show what I mean,

safe_data_head_reading <- safely(data_head_reading) # wrapping

all_data <- map(.x = paths, .f = safe_data_head_reading) # Now trying again but  with the wrapped function

all_data

[[1]]
[[1]]$result
# A tibble: 5 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1
2  21       6   160   110  3.9   2.88  17.0     0     1
3  22.8     4   108    93  3.85  2.32  18.6     1     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0
5  18.7     8   360   175  3.15  3.44  17.0     0     0
# … with 2 more variables: gear <dbl>, carb <dbl>

[[1]]$error
NULL


[[2]]
[[2]]$result
# A tibble: 5 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  30.4     4  95.1   113  3.77  1.51  16.9     1     1
2  15.8     8 351     264  4.22  3.17  14.5     0     1
3  19.7     6 145     175  3.62  2.77  15.5     0     1
4  15       8 301     335  3.54  3.57  14.6     0     1
5  21.4     4 121     109  4.11  2.78  18.6     1     1
# … with 2 more variables: gear <dbl>, carb <dbl>

[[2]]$error
NULL


[[3]]
[[3]]$result
NULL

[[3]]$error
<simpleError: 'test/mtcars3.csv' does not exist 
in current working directory.>

Now actually the loop runs (map function) and gives us both the error and correctly returned output, from which we can get the correctly returned output just by getting those result s, which we can do using map(), like this,

data_without_err <- all_data %>% 
  map("result") %>% 
  bind_rows(.id = "df")

> data_without_err

# A tibble: 10 × 12
   df      mpg   cyl  disp    hp  drat    wt  qsec    vs
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1      21       6 160     110  3.9   2.62  16.5     0
 2 1      21       6 160     110  3.9   2.88  17.0     0
 3 1      22.8     4 108      93  3.85  2.32  18.6     1
 4 1      21.4     6 258     110  3.08  3.22  19.4     1
 5 1      18.7     8 360     175  3.15  3.44  17.0     0
 6 2      30.4     4  95.1   113  3.77  1.51  16.9     1
 7 2      15.8     8 351     264  4.22  3.17  14.5     0
 8 2      19.7     6 145     175  3.62  2.77  15.5     0
 9 2      15       8 301     335  3.54  3.57  14.6     0
10 2      21.4     4 121     109  4.11  2.78  18.6     1
# … with 3 more variables: am <dbl>, gear <dbl>,
#   carb <dbl>

This approach will also work with urls too. Just simply write a function that will receive a url (link) and read that link and do what necessary using function from rvest and wrap that function with safely(). And then simply map that safely wrapped function over the a list of all links (good and bad) using map(), and then simply extract the result from the mapped ouput.