Deleting files based on the contents of the file in R-CodePudding

I have hundreds of .xlsx files that have the same structure. Some of the files in this folder are from an outdated structure that I do not use anymore. I am trying to apply a function to the files I want, while removing the ones I no longer want. I am trying to delete all of the files where the first column in the file contains the string "Prices". This never occurs in the new files (the ones I want to keep), and if it does not contain the "Prices" string, it will continue to use a function I have created to assign the wrangled data into a new list.

I have a for-loop that looks like so:

estimates_list <- vector(mode = "list", length = length(list.files(".")) # list I created
new_list <- vector(mode = "list", length = length(list.files("."))

for(i in 1:length(list.files("."))){
  estimates_list[[i]] <- read_excel(list.files(".")[i], col_names = F) #reads in all of the files to a list

  if(grepl("^Prices", estimates_list[[i]][,1])){ 

    file.remove(list.files(".")[i])

 } else{ 

   new_list[[i]] <- test_function(estimates_list[[i]])

 }
}

Everything works just as I want it to, except when the loop gets to a file where the grepl("^Prices", estimates_list[[i]][,1]) is TRUE at any point in the first column, the file file does not get deleted.

To give an example, Let's say I run grepl("^Prices", estimates_list[[1]][,1]), then it will output a vector of FALSE FALSE ... FALSE FLASE. Everything is FALSE, so I want it to then apply the function in the else statement.

Now, if I run grepl("^Prices", estimates_list[[24]][,1]), then I get a vector of TRUE FALSE TRUE FALSE FALSE TRUE. I want this file to be deleted and the loop to continue down the list of files until the end of the estimate_list object. I want to note that the "Prices" string does vary throughout the first column across all files.

CodePudding user response：

Maybe the following will work.

The code below does not create the two list in the for loop. In fact, it only creates new_list and only after the for loop. In the loop it uses a cells range spec output by cellranger::cell_limits to read only the 1st column and deletes the file if the regex pattern "^Prices" is found in the read in column.

Then, after updating the files vector fls, it reads in the files that were not deleted and processes them in two lapply loops. Processing is protected by tryCatch. Errors, if any, are put in an errors list err_list and new_list is the good data list.

I believe that like this it is simpler to detect errors.

fls <- list.files(".", full.names = TRUE)
first_col <- cellranger::cell_limits(c(1, 1), c(NA, 1))
for(i in 1:length(fls)){
  estimates_sheet <- readxl::read_excel(
    path = fls[i], 
    range = first_col,
    col_names = FALSE
  )
  if(any(grepl("^Prices", estimates_sheet[[1]]))){ 
    file.remove(fls[i])
  }
}

fls <- list.files(".", full.names = TRUE)
new_list <- lapply(fls, read_excel, col_names = FALSE)

new_list <- lapply(new_list, \(xl) {
  tryCatch(test_function(xl), error = function(e) e)
})
err <- sapply(new_list, inherits, "error")
err_list <- new_list[err]
new_list <- new_list[!err]