Home > Mobile >  Load in multiple CSV files and add suffix to column names in R
Load in multiple CSV files and add suffix to column names in R

Time:09-17

I am trying to load in a series of CSV files and then append a suffix to each column in the CSV (except for the primary key (subject_id). Each csv looks something like this

subject_id var1 var2
1 55 57
2 55 57

Imagine this csv file was titled data1 and subsequent files are titled data2, data3... etc.

For each csv that I load, I would like to to convert the table into something like

subject_id var1_data1 var2_data1
1 55 57
2 55 57
subject_id var1_data2 var2_data2
1 55 57
2 55 57

I know how to load in the datasets;

filenames <- list.files(path= "data", full.names = TRUE)

datasets <- lapply(filenames, read_csv) 

but I am struggling with figuring out how to write a loop/lapply statement to add the suffixes in the way I want.

The function below, will add a suffix but it is static.

lapply(datasets, function(df) {
  names(df)[-1] <- paste0(names(df)[-1], "_data1")
  df
})

The next thing I tried was to sandwhich a for loop in the middle of the function above

filenames2 <- sub('\\.csv$', '', list.files(path = "data"))

lapply(dataset3, function(df) {
  for (val in filenames2){
    names(df)[-1] <- paste0(names(df)[-1], val)
    df
  }
})

But this just changes everything to NULL/doesn't work. Does anyone have a thought on what might be the best way to proceed? I am also open to solutions in python, though R would be preferred.

Thank you!

CodePudding user response:

If every dataset has the same columns, another approach would be to make a single data.frame, where a column would be the dataset origin, here a way to do that.

datasets <-
  purrr::map_df(
    .x = filenames,
    .f = read_csv,
    .id = "dataset"
  )

CodePudding user response:

Suppose we have the files generated reproducibly in the Note at the end.

Then we get the file names in fnames and Map a function Read over them to read in each file and fix the names returning the fixed up data frame.

fnames <- Sys.glob("data*.csv")

Read <- function(f) {
  df <- read.csv(f)
  names(df)[-1] <- paste0(names(df[-1]), "_", sub(".csv$", "", basename(f)))
  df
}
L <- Map(Read, fnames)

str(L)

giving this named list:

List of 3
 $ data1.csv:'data.frame':      2 obs. of  3 variables:
  ..$ subject_id: int [1:2] 1 2
  ..$ var1_data1: int [1:2] 55 55
  ..$ var2_data1: int [1:2] 57 57
 $ data2.csv:'data.frame':      2 obs. of  3 variables:
  ..$ subject_id: int [1:2] 1 2
  ..$ var1_data2: int [1:2] 55 55
  ..$ var2_data2: int [1:2] 57 57
 $ data3.csv:'data.frame':      2 obs. of  3 variables:
  ..$ subject_id: int [1:2] 1 2
  ..$ var1_data3: int [1:2] 55 55
  ..$ var2_data3: int [1:2] 57 57

Note

Lines <- "subject_id var1 var2
1 55 57
2 55 57"
data1 <- data2 <- data3 <- read.table(text = Lines, header = TRUE)
for(f in c("data1", "data2", "data3")) write.csv(get(f), paste0(f, ".csv"), row.names = FALSE, quote = FALSE)
  • Related