R - How to combine many .csv files into one without overloading memory?-CodePudding

I have a total of 757 files about 80 MB each, totaling 57.4 GB worth of .csv data.

I am trying to combine all of these files into one file. The code I used worked on a small sample of the data (52 files), but it is running into an error when I try to do it on all of the data.

This is the code I used for the sample:

# Reading in multiple files of genetic data at once
setwd()
files = list.files(pattern = "*.csv")
gene1 <- do.call(rbind, lapply(files, fread))
rm(files)

gene1 <- as.data.frame(unclass(gene1))

The problem I'm having is that RStudio crashes after a few minutes of combining the data. The sample took about a minute to do.

Anyway, I'm guessing that it's a memory issue or something. Is there a way I can keep the memory from getting full? I know Python has a chunking option of some sort.

If you think the issue is something else, please let me know what you think.

Thank you!

CodePudding user response：

Here is a brute force "chunk by 50" solution

files = list.files(pattern = "*.csv")
chunks=data.frame(x=c(seq(from=1,to=751, by=50)),
                  y=c(seq(from=50,to=750, by=50), 757))
all_genes<-data.table()
for (i in 1:NROW(chunks)) {
  gene1 <- do.call(rbind, lapply(files[chunks[i,"x"]:chunks[i,"y"]], fread))  
  all_genes <- rbind(gene1, all_genes)
}
gene1 <- as.data.frame(unclass(gene1))

CodePudding user response：

For a reframe the question perspective, I propose the command line CSV Tool Kit. In particular the CSVstack tool

https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html

Joining a set of homogeneous files

$ csvstack folder/*.csv > all_csvs.csv