R - apply data.file function on each file in a folder and export them-CodePudding

I'm looking for some help with the following. I have many files in a folder, each of them is a txt file containing 16 columns like this:

head(a1)

v1	v2	...	v16
2.0742	1.1520	...	5.6852
-1.4071	1.1848	...	2.7629

which I want to transform into a single long column, using the library data.table :

library(data.table)    
setDT(a1)
a1<-melt(a1)[, .(value)]

v1
2.0742
-1.4071
...
2.7629

What I want to do is automate with a for loop reading each file in the folder, applying the function melt and exporting into another folder the transformed files. Any idea from where to start?

CodePudding user response：

What I am able to do right now is do it one by one with the following code:

#Get the path of filenames
    filenames <- list.files("C:/Users/env/OneDrive/Desktop/groups/group1", full.names = TRUE)
#Read them in a list
    list_data <- lapply(filenames, read.table)
#Name them as per your choice (df_1, df_2 etc)
    names(list_data) <- paste('a', seq_along(filenames), sep = '')
#Create objects in global environment.
    list2env(list_data, .GlobalEnv)
     
    setDT(a1)
    a1<-melt(a1)[, .(value)]
    
#export as csv
    write.csv(a1, "C:/Users/env/OneDrive/Desktop/groups/a1.csv", row.names = FALSE)
    
#export file as rda
    saveRDS(a1,"C:/Users/env/OneDrive/Desktop/groups/a1.Rda")

What I would like to do is to build a for loop to automate the procedure, starting from listing the files in the two folders like this:

folders <- list("C:/Users/env/OneDrive/Desktop/groups/group1","C:/Users/env/OneDrive/Desktop/groups/group2")

CodePudding user response：

According to OP's comments, there are 2 directories with 50 files with 7000 rows and 16 columns each. Assuming all columns are of type double which require 8 Bytes each, the total data volume is somewhat around 100 MBytes which can be stored and processed in memory.

So, my suggestion is to read all data in one go and combine and process it in one large data.table in memory.

Here is what I would do using my preferred tools:

library(data.table)
library(magrittr)
file_names <- list.files(test_dir, full.names = TRUE)
all_wide <- lapply(file_names, fread) %>% 
  set_names(basename(file_names)) %>% 
  rbindlist(idcol = "file_name")
all_long <- melt(all_wide, id.vars = "file_name")
all_long

           file_name variable     value
              <char>   <fctr>     <num>
      1: File001.txt       V1  101.0000
      2: File001.txt       V1  101.0000
      3: File001.txt       V1  101.0000
      4: File001.txt       V1  101.0000
      5: File001.txt       V1  101.0001
     ---                               
5599996: File050.txt      V16 5016.0700
5599997: File050.txt      V16 5016.0700
5599998: File050.txt      V16 5016.0700
5599999: File050.txt      V16 5016.0700
5600000: File050.txt      V16 5016.0700

This processes all files in directory test_dir.

Memory consumption can be displayed by

tables()

       NAME      NROW NCOL  MB                         COLS KEY
1: all_long 5,600,000    3 107     file_name,variable,value    
2: all_wide   350,000   17  45 file_name,V1,V2,V3,V4,V5,...    
3:        d     7,000   16   1        V1,V2,V3,V4,V5,V6,...    
Total: 153MB

The source of each row can be identified by file_name.

Data for testing

Warning: The code below will create a subdirectory and nfil files in the TMPDIR directory.

library(data.table)
nfil <- 50 # number of files
nrow <- 7000 # number of rows per file
ncol <- 16 # number of columns
test_dir <- file.path(tempdir(), paste0("files_in_", as.integer(Sys.time())))
print(test_dir)
dir.create(test_dir)
for (ifil in seq(nfil)) {
  d <- data.table()
  for (icol in seq(ncol)) set(d, , paste0("V", icol), ifil * 100   icol   seq(nrow)/10^(ceiling(log10(nrow)) 1))
  fwrite(d, file.path(test_dir, sprintf("Filei.txt", ifil)))
  print(d)
}
dir(test_dir)