I have the following folder structure,
-test1
-test2
-test3
-test4
-test5
within those folders there are .tsv files I am able to open via a function i wrote:
process <- function(f){
df <- read.csv(f, sep = "\t", header = F)
colnames(df) <- c("1","2","3","4","5","6","7","8","9")
df <- df[-c(1, 2, 3, 4, 5, 6),]
df <- df[c("1", "7")]
df <- merge(df, df, by="1")
print(df)
}
files <- dir("path", recursive = T, full.names = T, pattern = "*.tsv")
sapply(files, process)
this prints the dataframes I need, but what I want to do is automatically merge the dataframes into one, merge on column 1 but the code above does not do as I want, I get the following error: Error in as.data.frame(y) : argument "y" is missing, with no default
CodePudding user response:
Here is a solution. Untested, since there are no data.
process <- function(f){
df <- read.delim(f, header = FALSE)
colnames(df) <- c("1","2","3","4","5","6","7","8","9")
df <- df[-c(1, 2, 3, 4, 5, 6),]
df[c("1", "7")]
}
files <- dir("path", recursive = T, full.names = T, pattern = "\\.tsv$")
df_list <- lapply(files, process)
After reading the files into df_list
the following will merge (join) them by column "1"
and the result is wider.
df_final <- Reduce(\(x, y) merge(x, y, by = "1"), df_list)
names(df_final)[-1] <- sprintf("Vard", seq_along(names(df_final)[-1]))
If instead you want to bind the files by rows, with a longer result use
df_final <- do.call(rbind, df_list)
Or, to know the files where the data comes from, include their names in a new column.
df_list2 <- lapply(seq_along(files), \(i) {
cbind(data.frame(file = files[i]), df_list[[i]])
})
df_final_long <- do.call(rbind, df_list2)