I have hundreds of .xlsx files that have the same structure. Some of the files in this folder are from an outdated structure that I do not use anymore. I am trying to apply a function to the files I want, while removing the ones I no longer want. I am trying to delete all of the files where the first column in the file contains the string "Prices"
. This never occurs in the new files (the ones I want to keep), and if it does not contain the "Prices"
string, it will continue to use a function
I have created to assign the wrangled data into a new list.
I have a for-loop that looks like so:
estimates_list <- vector(mode = "list", length = length(list.files(".")) # list I created
new_list <- vector(mode = "list", length = length(list.files("."))
for(i in 1:length(list.files("."))){
estimates_list[[i]] <- read_excel(list.files(".")[i], col_names = F) #reads in all of the files to a list
if(grepl("^Prices", estimates_list[[i]][,1])){
file.remove(list.files(".")[i])
} else{
new_list[[i]] <- test_function(estimates_list[[i]])
}
}
Everything works just as I want it to, except when the loop gets to a file where the grepl("^Prices", estimates_list[[i]][,1])
is TRUE
at any point in the first column, the file file does not get deleted.
To give an example, Let's say I run grepl("^Prices", estimates_list[[1]][,1])
, then it will output a vector of FALSE FALSE ... FALSE FLASE
. Everything is FALSE
, so I want it to then apply the function in the else
statement.
Now, if I run grepl("^Prices", estimates_list[[24]][,1])
, then I get a vector of TRUE FALSE TRUE FALSE FALSE TRUE
. I want this file to be deleted and the loop to continue down the list of files until the end of the estimate_list object. I want to note that the "Prices"
string does vary throughout the first column across all files.
CodePudding user response:
Maybe the following will work.
The code below does not create the two list in the for
loop. In fact, it only creates new_list
and only after the for
loop. In the loop it uses a cells range spec output by cellranger::cell_limits
to read only the 1st column and deletes the file if the regex pattern "^Prices"
is found in the read in column.
Then, after updating the files vector fls
, it reads in the files that were not deleted and processes them in two lapply
loops. Processing is protected by tryCatch
. Errors, if any, are put in an errors list err_list
and new_list
is the good data list.
I believe that like this it is simpler to detect errors.
fls <- list.files(".", full.names = TRUE)
first_col <- cellranger::cell_limits(c(1, 1), c(NA, 1))
for(i in 1:length(fls)){
estimates_sheet <- readxl::read_excel(
path = fls[i],
range = first_col,
col_names = FALSE
)
if(any(grepl("^Prices", estimates_sheet[[1]]))){
file.remove(fls[i])
}
}
fls <- list.files(".", full.names = TRUE)
new_list <- lapply(fls, read_excel, col_names = FALSE)
new_list <- lapply(new_list, \(xl) {
tryCatch(test_function(xl), error = function(e) e)
})
err <- sapply(new_list, inherits, "error")
err_list <- new_list[err]
new_list <- new_list[!err]