I'm looking for some help with the following. I have many files in a folder, each of them is a txt file containing 16 columns like this:
head(a1)
v1 | v2 | ... | v16 |
---|---|---|---|
2.0742 | 1.1520 | ... | 5.6852 |
-1.4071 | 1.1848 | ... | 2.7629 |
which I want to transform into a single long column, using the library data.table :
library(data.table)
setDT(a1)
a1<-melt(a1)[, .(value)]
v1 |
---|
2.0742 |
-1.4071 |
... |
2.7629 |
What I want to do is automate with a for loop reading each file in the folder, applying the function melt and exporting into another folder the transformed files. Any idea from where to start?
CodePudding user response:
What I am able to do right now is do it one by one with the following code:
#Get the path of filenames
filenames <- list.files("C:/Users/env/OneDrive/Desktop/groups/group1", full.names = TRUE)
#Read them in a list
list_data <- lapply(filenames, read.table)
#Name them as per your choice (df_1, df_2 etc)
names(list_data) <- paste('a', seq_along(filenames), sep = '')
#Create objects in global environment.
list2env(list_data, .GlobalEnv)
setDT(a1)
a1<-melt(a1)[, .(value)]
#export as csv
write.csv(a1, "C:/Users/env/OneDrive/Desktop/groups/a1.csv", row.names = FALSE)
#export file as rda
saveRDS(a1,"C:/Users/env/OneDrive/Desktop/groups/a1.Rda")
What I would like to do is to build a for loop to automate the procedure, starting from listing the files in the two folders like this:
folders <- list("C:/Users/env/OneDrive/Desktop/groups/group1","C:/Users/env/OneDrive/Desktop/groups/group2")
CodePudding user response:
According to OP's comments, there are 2 directories with 50 files with 7000 rows and 16 columns each. Assuming all columns are of type double which require 8 Bytes each, the total data volume is somewhat around 100 MBytes which can be stored and processed in memory.
So, my suggestion is to read all data in one go and combine and process it in one large data.table in memory.
Here is what I would do using my preferred tools:
library(data.table)
library(magrittr)
file_names <- list.files(test_dir, full.names = TRUE)
all_wide <- lapply(file_names, fread) %>%
set_names(basename(file_names)) %>%
rbindlist(idcol = "file_name")
all_long <- melt(all_wide, id.vars = "file_name")
all_long
file_name variable value <char> <fctr> <num> 1: File001.txt V1 101.0000 2: File001.txt V1 101.0000 3: File001.txt V1 101.0000 4: File001.txt V1 101.0000 5: File001.txt V1 101.0001 --- 5599996: File050.txt V16 5016.0700 5599997: File050.txt V16 5016.0700 5599998: File050.txt V16 5016.0700 5599999: File050.txt V16 5016.0700 5600000: File050.txt V16 5016.0700
This processes all files in directory test_dir
.
Memory consumption can be displayed by
tables()
NAME NROW NCOL MB COLS KEY 1: all_long 5,600,000 3 107 file_name,variable,value 2: all_wide 350,000 17 45 file_name,V1,V2,V3,V4,V5,... 3: d 7,000 16 1 V1,V2,V3,V4,V5,V6,... Total: 153MB
The source of each row can be identified by file_name
.
Data for testing
Warning: The code below will create a subdirectory and nfil
files in the TMPDIR
directory.
library(data.table)
nfil <- 50 # number of files
nrow <- 7000 # number of rows per file
ncol <- 16 # number of columns
test_dir <- file.path(tempdir(), paste0("files_in_", as.integer(Sys.time())))
print(test_dir)
dir.create(test_dir)
for (ifil in seq(nfil)) {
d <- data.table()
for (icol in seq(ncol)) set(d, , paste0("V", icol), ifil * 100 icol seq(nrow)/10^(ceiling(log10(nrow)) 1))
fwrite(d, file.path(test_dir, sprintf("Filei.txt", ifil)))
print(d)
}
dir(test_dir)