I have a data frame with many rows (>9000) and columns (148). The first column has a unique code for experiments, the other columns are populated with the name of the clones tested in the experiment. I want a matrix that has the number of common clones among each experiment (pairwise).
Example of my dataset:
Exp_No Clone1 Clone2 Clone3 Clone4
Exp1 Egxn2 Egxn11 Egxn6 Egxn13
Exp2 Egxn4 Egxn13 Egxn16 Egxn6
Exp3 Egxn2 Egxn6 Egxn11 Egxn18
Exp4 Egxn6 Egxn14 Egxn4 Egxn18
Exp5 Egxn2 Egxn11 Egxn6 Egxn13
Exp6 Egxn4 Egxn2 Egxn5 Egxn18
What I need:
Exp1 Exp2 2
Exp1 Exp3 3
Exp1 Exp4 1
Exp1 Exp5 4
Exp1 Exp6 1
Exp2 Exp3 1
Exp2 Exp4 2
...
and so on for all pairs of rows. Any suggestion? Thank you in advance, been on this for a few hours! I could not find a way to resolve this.
CodePudding user response:
You can stack all of the Clone
columns and then merge it to itself by the names of the clones.
Since dplyr 1.1.0
library(dplyr)
df_long <- df %>%
tidyr::pivot_longer(contains('Clone'), names_to = NULL)
df_long %>%
inner_join(df_long, by = join_by(value, y$Exp_No > x$Exp_No)) %>%
count(Exp_No.x, Exp_No.y)
# # A tibble: 15 × 3
# Exp_No.x Exp_No.y n
# <chr> <chr> <int>
# 1 Exp1 Exp2 2
# 2 Exp1 Exp3 3
# 3 Exp1 Exp4 1
# 4 Exp1 Exp5 4
# 5 Exp1 Exp6 1
# 6 Exp2 Exp3 1
# 7 Exp2 Exp4 2
# 8 Exp2 Exp5 2
# 9 Exp2 Exp6 1
# 10 Exp3 Exp4 2
# 11 Exp3 Exp5 3
# 12 Exp3 Exp6 2
# 13 Exp4 Exp5 1
# 14 Exp4 Exp6 2
# 15 Exp5 Exp6 1
dplyr 1.0.0
or older
df_long %>%
inner_join(df_long, by = "value") %>%
filter(Exp_No.y > Exp_No.x) %>%
count(Exp_No.x, Exp_No.y)
Data
df <- read.table(text = "
Exp_No Clone1 Clone2 Clone3 Clone4
Exp1 Egxn2 Egxn11 Egxn6 Egxn13
Exp2 Egxn4 Egxn13 Egxn16 Egxn6
Exp3 Egxn2 Egxn6 Egxn11 Egxn18
Exp4 Egxn6 Egxn14 Egxn4 Egxn18
Exp5 Egxn2 Egxn11 Egxn6 Egxn13
Exp6 Egxn4 Egxn2 Egxn5 Egxn18", header = TRUE)
CodePudding user response:
The result will have a very large number of rows (tens of millions). For such a large dataset, tcrossprod
on a sparse matrix should give good performance. Demonstrating with 10K rows and 148 columns:
library(Matrix) # for sparse matrices
library(data.table) # final solution will be stored as a data.table
m <- as(
triu( # get the upper triangle of the symmetric matrix
tcrossprod( # tcrossprod to get the pairwise common clone counts
# convert the data.frame to a sparse matrix with nrow(df) rows and
# length(unique(unlist(df[,-1]))) columns (the number of unique clones
# in the dataset)
sparseMatrix(
rep(1:1e4, 148),
as.numeric(gsub("Egxn", "", unlist(df[,-1], 0, 0))),
x = 1L
)
), k = 1 # don't keep the diagonal (comparing rows with themselves)
), "dgTMatrix" # set the result as a triangular matrix
)
# build the final answer
dtPairs <- setorder(
data.table(
# the Exp1 and Exp2 columns are row indices from df
# sparse matrix indices are zero-based, so add one
Exp1 = attr(m, "i") 1L,
Exp2 = attr(m, "j") 1L,
Common = attr(m, "x")
), Exp1, Exp2 # sort by Exp1 then by Exp2
)
dtPairs[1:10,]
#> Exp1 Exp2 Common
#> 1: 1 2 21
#> 2: 1 3 29
#> 3: 1 4 24
#> 4: 1 5 25
#> 5: 1 6 32
#> 6: 1 7 17
#> 7: 1 8 23
#> 8: 1 9 20
#> 9: 1 10 25
#> 10: 1 11 24
nrow(dtPairs)
#> [1] 49995000
Data:
df <- cbind(
data.frame(Exp_No = paste0("Exp", 1:1e4)),
matrix(
paste0("Egxn", replicate(1e4, sample(1e3, 148))),
1e4, 148, 1, list(NULL, paste0("Clone", 1:148))
)
)