loop (for) in R to read separately many files (from list of files)-CodePudding

I have a list of Excel files that I want to read. As they have sometimes different columns I want to read them separately, giving them their location as name.

I'm new in loops, and I cannot get to name them differently. This is what I tried, but reads only one (of 40) file.

myfiles = list.files(path="Invoices/GLS/", 
                     pattern = "0_*", full.names = TRUE, recursive = T)
myfiles

for(val in myfiles) {
   val <- read_excel(val)
}

CodePudding user response：

Read them in a list using sapply -

myfiles = list.files(path="Invoices/GLS/", pattern = "0_*", 
                     full.names = TRUE, recursive = TRUE)

val <- sapply(myfiles, readxl::read_excel, simplify = FALSE)

names(val) should return path of the files.

CodePudding user response：

As mentioned by Ronak, you're probably better off with the sapply approach because of verbosity and ease of reading, as well as preventing your environment to be cluttered with objects (this would only make processing more labour intensive down the line, as well). For the sake of illustration, I'll show an approach using a for loop first, followed by sapply() and lapply().

The reason you only have one dataframe in your environment is because you're assigning data to the same object in each iteration. As a result, your object is overwritten whenever that line is called in your loop. At the end of the loop, val will contain the last file that was loaded via read_excel(val). To solve this, you could use the assign() approach.

for(val in seq_along(myfiles)) {
  # create name of object to assign file to
  obj_name <- paste0("df_", val)
  
  # Import file
  df_file <- read_excel(myfiles[val])
  
  # Assign it to a separate object in global environment
  assign(obj_name, df_file)
  
  # Ensure no df_file remains in environment when loop ends
  rm(df_file))
}

Naturally, you could give your R objects other names than df_1, df_2 and so forth. You can also write the above in less lines of code, but that would defeat the purpose of illustrating the different steps.

As for sapply(simplify = FALSE) and lapply(), your data would not be loaded into separate objects in your environment. Instead your datasets would be loaded into a list. As an extra step, you can easily make 1 single dataframe out of that list if you so desire. Regardless, I'd opt for having 1 list of 40 dataframes over 40 objects sitting in my environment.

A convenient feature of sapply() is the USE.NAMES argument, which defaults to TRUE. This will set the names of the elements of the resulting list to your X argument (your input vector with filenames). The simlify = FALSE prevents the function from trying coercing your list to another structure (in this case, it would become a matrix).

# load excel files into a list of dataframes, retaining filenames
df_list <- sapply(myfiles, read_excel, simplify = FALSE)

However, in my opinion, the names of your list elements can become quite messy if you're dealing with FULL filenames (as a result of full.names = TRUE in list.files(). In the two-step approach below, you first load your dataframes into a list, and then set the names of said list using some regular expression. This expression would only extract the last part of the filepath that contains the actual name of your file (e.g. invoice542.xlsx)

# load excel files into a list of dataframes
df_list <- lapply(myfiles, read_excel)

# name list elements using regex
names(val2) <- sub(r"{[\w\/] (?:\/)}", "", myfiles, perl = TRUE)

Bear in mind that regex isn't my strong suit, so the above expression could probably be written more clearly/concise.

And that's it for my first ever SO contribution!