Merging thousands of csv files into a single dataframe in R-CodePudding

I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.

I have written the following code:

read_data <- function(ex_files,ex){
  df <- NULL
  ex <- as.character(ex)
  for(n in 1:length(ex_files)){
    temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
    temp$generation <- n
    df <- rbind(df,temp)
  }
  return(df)
}

ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).

I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?

CodePudding user response：

It is inefficient to grow objects in a loop. List all the files that you want to read using list.files and with purrr::map_df combine them into one dataframe with an additional column called generation which will give a unique number to each file.

filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)

CodePudding user response：

Try plyr package

filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)