I have several data frame with format like below. I want to join/merge the data frames by species
and extracting kmers
from all data frames such that the out contains one column with species
and multiple column with kmers
, one form each of the files. The kmers
column will then be give the name of the file from which it originated.
df1
reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus
df2
reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus
out
species df1 df2
Bacteria 23234 56
Virus 4545 12
I have tried making a script using join_all, but it does not select the correct column (kmers
):
file_list = list.files(pattern="tsv$")
datalist = lapply(file_list, function(x){
dat = read.csv(file=x, header=T, sep = "\t")
names(dat)[2] = x
return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )
CodePudding user response:
I'll assume that you've read in the files into a list of frames, named by the basename of the file (with the extension removed). Naming the list-of-frames as dfs
, we have
dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))
dfs
# $df1
# reads taxReads kmers species
# 1 232 2323 23234 Bacteria
# 2 555 12 4545 Virus
# $df2
# reads taxReads kmers species
# 1 12 23 56 Bacteria
# 2 932 1213 12 Virus
From here, two steps:
Rename the
kmers
columns to the filename (sans extension), and filter out unneeded columns,dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs)) dfs # $df1 # species df1 # 1 Bacteria 23234 # 2 Virus 4545 # $df2 # species df2 # 1 Bacteria 56 # 2 Virus 12
Reduce with
merge
.Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs) # species df1 df2 # 1 Bacteria 23234 56 # 2 Virus 4545 12
This could be code-golfed here with just
Reduce(merge, dfs)
, but I broke it out with a two-arg anon-func so that you can control some ofmerge
's options.