I have a total of 757 files about 80 MB each, totaling 57.4 GB worth of .csv data.
I am trying to combine all of these files into one file. The code I used worked on a small sample of the data (52 files), but it is running into an error when I try to do it on all of the data.
This is the code I used for the sample:
# Reading in multiple files of genetic data at once
setwd()
files = list.files(pattern = "*.csv")
gene1 <- do.call(rbind, lapply(files, fread))
rm(files)
gene1 <- as.data.frame(unclass(gene1))
The problem I'm having is that RStudio crashes after a few minutes of combining the data. The sample took about a minute to do.
Anyway, I'm guessing that it's a memory issue or something. Is there a way I can keep the memory from getting full? I know Python has a chunking option of some sort.
If you think the issue is something else, please let me know what you think.
Thank you!
CodePudding user response:
Here is a brute force "chunk by 50" solution
files = list.files(pattern = "*.csv")
chunks=data.frame(x=c(seq(from=1,to=751, by=50)),
y=c(seq(from=50,to=750, by=50), 757))
all_genes<-data.table()
for (i in 1:NROW(chunks)) {
gene1 <- do.call(rbind, lapply(files[chunks[i,"x"]:chunks[i,"y"]], fread))
all_genes <- rbind(gene1, all_genes)
}
gene1 <- as.data.frame(unclass(gene1))
CodePudding user response:
For a reframe the question perspective, I propose the command line CSV Tool Kit. In particular the CSVstack
tool
https://csvkit.readthedocs.io/en/latest/scripts/csvstack.html
Joining a set of homogeneous files
$ csvstack folder/*.csv > all_csvs.csv