Home > database >  merge dataframes in a function in R
merge dataframes in a function in R

Time:09-12

I have the following folder structure,

-test1
-test2
-test3
-test4
-test5

within those folders there are .tsv files I am able to open via a function i wrote:

process <- function(f){
  df <- read.csv(f, sep = "\t", header = F)
  colnames(df) <- c("1","2","3","4","5","6","7","8","9")
  df <- df[-c(1, 2, 3, 4, 5, 6),]
  df <- df[c("1", "7")]
  df <- merge(df, df, by="1") 
  print(df)
}
files <- dir("path", recursive = T, full.names = T, pattern = "*.tsv")
sapply(files, process)

this prints the dataframes I need, but what I want to do is automatically merge the dataframes into one, merge on column 1 but the code above does not do as I want, I get the following error: Error in as.data.frame(y) : argument "y" is missing, with no default

CodePudding user response:

Here is a solution. Untested, since there are no data.

process <- function(f){
  df <- read.delim(f, header = FALSE)
  colnames(df) <- c("1","2","3","4","5","6","7","8","9")
  df <- df[-c(1, 2, 3, 4, 5, 6),]
  df[c("1", "7")]
}
files <- dir("path", recursive = T, full.names = T, pattern = "\\.tsv$")
df_list <- lapply(files, process)

After reading the files into df_list the following will merge (join) them by column "1" and the result is wider.

df_final <- Reduce(\(x, y) merge(x, y, by = "1"), df_list)
names(df_final)[-1] <- sprintf("Vard", seq_along(names(df_final)[-1]))

If instead you want to bind the files by rows, with a longer result use

df_final <- do.call(rbind, df_list)

Or, to know the files where the data comes from, include their names in a new column.

df_list2 <- lapply(seq_along(files), \(i) {
  cbind(data.frame(file = files[i]), df_list[[i]])
})
df_final_long <- do.call(rbind, df_list2)
  •  Tags:  
  • r
  • Related