I made up a dataframe to explain my question, my real dataset is much bigger.
gene <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
expression <- c("5", "6", "8", "3", "5", "7", "7", "8", "9")
data.frame(gene, sample, expression)
gene sample expression
1 a a 5
2 b a 6
3 c a 8
4 a b 3
5 b b 5
6 c b 7
7 a c 7
8 b c 8
9 c c 9
and
gene2 <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample2 <- c("1", "1", "1", "2", "2", "2", "3", "3", "3")
expression2 <- c("5.4", "6.3", "8", "3.2", "5.4", "7.2", "7.1", "8.2", "9.4")
data.frame(gene2, sample2, expression2)
gene2 sample2 expression2
1 a 1 5.4
2 b 1 6.3
3 c 1 8
4 a 2 3.2
5 b 2 5.4
6 c 2 7.2
7 a 3 7.1
8 b 3 8.2
9 c 3 9.4
So I have 2 different dataframes with different sample identifiers. But the expression data (should) be kind of the same. What I want to do is find per sample the closest matching expression values and report back the corresponding sample identifiers. so it could look something like this:
gene sample sample2 expression expression2
1 a a 1 5 5.4
2 b a 1 6 6.3
3 c a 1 8 8
4 a b 2 3 3.2
5 b b 2 5 5.4
6 c b 2 7 7.2
7 a c 3 7 7.1
8 b c 3 8 8.2
9 c c 3 9 9.4
I would think maybe a roll join
but im kind of lost on this
CodePudding user response:
You can do a rolling join with data.table
:
library(data.table)
setDT(df1)[, expression := as.numeric(expression)]
setDT(df2)[, ":="(sample = unique(df1$sample)[as.numeric(sample2)],
gene = gene2,
expression = as.numeric(expression2))]
df <- df2[df1, on = .(gene, sample, expression), roll = "nearest"][, gene2 := NULL][]
setcolorder(df, rev(seq_along(df)))
df
# gene expression sample expression2 sample2
# 1: a 5 a 5.4 1
# 2: b 6 a 6.3 1
# 3: c 8 a 8 1
# 4: a 3 b 3.2 2
# 5: b 5 b 5.4 2
# 6: c 7 b 7.2 2
# 7: a 7 c 7.1 3
# 8: b 8 c 8.2 3
# 9: c 9 c 9.4 3
CodePudding user response:
You can use split
(to compare genes), outer
(to create distance matrix) and apply
(for each row find column which has minimum value). Using mapply
you can wrap everything together:
data:
df1 <- data.frame(gene, sample, expression, stringsAsFactors = FALSE)
df2 <- data.frame(gene2, sample2, expression2, stringsAsFactors = FALSE)
df1$expression <- as.numeric(df1$expression)
df2$expression2 <- as.numeric(df2$expression2)
code:
do.call(
rbind,
mapply(
function(x, y){
j <- apply(
abs(outer(x$expression, y$expression2, FUN = "-")), 1, which.min
)
cbind(x, y[j,])
},
split(df1, df1$gene),
split(df2, df2$gene2),
SIMPLIFY = FALSE
)
)