I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.
I have written the following code:
read_data <- function(ex_files,ex){
df <- NULL
ex <- as.character(ex)
for(n in 1:length(ex_files)){
temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
temp$generation <- n
df <- rbind(df,temp)
}
return(df)
}
ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).
I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?
CodePudding user response:
It is inefficient to grow objects in a loop. List all the files that you want to read using list.files
and with purrr::map_df
combine them into one dataframe with an additional column called generation
which will give a unique number to each file.
filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)
CodePudding user response:
Try plyr
package
filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)