Home > Software engineering >  join data frames for specific column
join data frames for specific column

Time:10-23

I have several data frame with format like below. I want to join/merge the data frames by species and extracting kmers from all data frames such that the out contains one column with species and multiple column with kmers, one form each of the files. The kmers column will then be give the name of the file from which it originated. df1

reads taxReads kmers species
232 2323 23234 Bacteria
555 12 4545 Virus

df2

reads taxReads kmers species
12 23 56 Bacteria
932 1213 12 Virus

out

species df1 df2
Bacteria 23234 56
Virus 4545 12

I have tried making a script using join_all, but it does not select the correct column (kmers):

file_list = list.files(pattern="tsv$")    

datalist = lapply(file_list, function(x){
  dat = read.csv(file=x, header=T, sep = "\t")
  names(dat)[2] = x
  return(dat)
})
joined <- join_all(dfs = datalist,by = "species",type ="full" )  

CodePudding user response:

I'll assume that you've read in the files into a list of frames, named by the basename of the file (with the extension removed). Naming the list-of-frames as dfs, we have

dfs <- list(df1 = structure(list(reads = c(232L, 555L), taxReads = c(2323L, 12L), kmers = c(23234L, 4545L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)), df2 = structure(list(reads = c(12L, 932L), taxReads = c(23L, 1213L), kmers = c(56L,12L), species = c("Bacteria", "Virus")), class = "data.frame", row.names = c(NA, -2L)))

dfs
# $df1
#   reads taxReads kmers  species
# 1   232     2323 23234 Bacteria
# 2   555       12  4545    Virus
# $df2
#   reads taxReads kmers  species
# 1    12       23    56 Bacteria
# 2   932     1213    12    Virus

From here, two steps:

  1. Rename the kmers columns to the filename (sans extension), and filter out unneeded columns,

    dfs <- Map(function(x, nm) { names(x)[names(x) == "kmers"] <- nm; x[, c("species", nm)]; }, dfs, names(dfs))
    dfs
    # $df1
    #    species   df1
    # 1 Bacteria 23234
    # 2    Virus  4545
    # $df2
    #    species df2
    # 1 Bacteria  56
    # 2    Virus  12
    
  2. Reduce with merge.

    Reduce(function(d1, d2) merge(d1, d2, by = "species", all = TRUE), dfs)
    #    species   df1 df2
    # 1 Bacteria 23234  56
    # 2    Virus  4545  12
    

    This could be code-golfed here with just Reduce(merge, dfs), but I broke it out with a two-arg anon-func so that you can control some of merge's options.

  • Related