Home > Back-end >  Matching samples in R
Matching samples in R

Time:10-27

I made up a dataframe to explain my question, my real dataset is much bigger.

gene <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
expression <- c("5", "6", "8", "3", "5", "7", "7", "8", "9")
data.frame(gene, sample, expression)

  gene sample expression
1    a      a          5
2    b      a          6
3    c      a          8
4    a      b          3
5    b      b          5
6    c      b          7
7    a      c          7
8    b      c          8
9    c      c          9

and

gene2 <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample2 <- c("1", "1", "1", "2", "2", "2", "3", "3", "3")
expression2 <- c("5.4", "6.3", "8", "3.2", "5.4", "7.2", "7.1", "8.2", "9.4")
data.frame(gene2, sample2, expression2)

  gene2 sample2 expression2
1     a       1         5.4
2     b       1         6.3
3     c       1           8
4     a       2         3.2
5     b       2         5.4
6     c       2         7.2
7     a       3         7.1
8     b       3         8.2
9     c       3         9.4

So I have 2 different dataframes with different sample identifiers. But the expression data (should) be kind of the same. What I want to do is find per sample the closest matching expression values and report back the corresponding sample identifiers. so it could look something like this:

  gene sample sample2 expression expression2
1    a      a       1          5         5.4
2    b      a       1          6         6.3
3    c      a       1          8           8
4    a      b       2          3         3.2
5    b      b       2          5         5.4
6    c      b       2          7         7.2
7    a      c       3          7         7.1
8    b      c       3          8         8.2
9    c      c       3          9         9.4

I would think maybe a roll join but im kind of lost on this

CodePudding user response:

You can do a rolling join with data.table:

library(data.table)
setDT(df1)[, expression := as.numeric(expression)]
setDT(df2)[, ":="(sample = unique(df1$sample)[as.numeric(sample2)],
                  gene = gene2,
                  expression = as.numeric(expression2))]


df <- df2[df1, on = .(gene, sample, expression), roll = "nearest"][, gene2 := NULL][]
setcolorder(df, rev(seq_along(df)))
df

#    gene expression sample expression2 sample2
# 1:    a          5      a         5.4       1
# 2:    b          6      a         6.3       1
# 3:    c          8      a           8       1
# 4:    a          3      b         3.2       2
# 5:    b          5      b         5.4       2
# 6:    c          7      b         7.2       2
# 7:    a          7      c         7.1       3
# 8:    b          8      c         8.2       3
# 9:    c          9      c         9.4       3

CodePudding user response:

You can use split (to compare genes), outer (to create distance matrix) and apply (for each row find column which has minimum value). Using mapply you can wrap everything together:

data:

df1 <- data.frame(gene, sample, expression, stringsAsFactors = FALSE)
df2 <- data.frame(gene2, sample2, expression2, stringsAsFactors = FALSE)

df1$expression <- as.numeric(df1$expression)
df2$expression2 <- as.numeric(df2$expression2)

code:

do.call(
  rbind,
  mapply(
    function(x, y){
      j <- apply(
        abs(outer(x$expression, y$expression2, FUN = "-")), 1, which.min
      )
      cbind(x, y[j,])
    },
    split(df1, df1$gene),
    split(df2, df2$gene2),
    SIMPLIFY = FALSE
  )
)
  •  Tags:  
  • r
  • Related