Pairwise count of common values in rows of a data.frame-CodePudding

I have a data frame with many rows (>9000) and columns (148). The first column has a unique code for experiments, the other columns are populated with the name of the clones tested in the experiment. I want a matrix that has the number of common clones among each experiment (pairwise).

Example of my dataset:

Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18

What I need:

Exp1  Exp2  2
Exp1  Exp3  3
Exp1  Exp4  1    
Exp1  Exp5  4
Exp1  Exp6  1
Exp2  Exp3  1
Exp2  Exp4  2
...

and so on for all pairs of rows. Any suggestion? Thank you in advance, been on this for a few hours! I could not find a way to resolve this.

CodePudding user response：

You can stack all of the Clone columns and then merge it to itself by the names of the clones.

Since `dplyr 1.1.0`

library(dplyr)

df_long <- df %>%
  tidyr::pivot_longer(contains('Clone'), names_to = NULL)

df_long %>%
  inner_join(df_long, by = join_by(value, y$Exp_No > x$Exp_No)) %>%
  count(Exp_No.x, Exp_No.y)

# # A tibble: 15 × 3
#    Exp_No.x Exp_No.y     n
#    <chr>    <chr>    <int>
#  1 Exp1     Exp2         2
#  2 Exp1     Exp3         3
#  3 Exp1     Exp4         1
#  4 Exp1     Exp5         4
#  5 Exp1     Exp6         1
#  6 Exp2     Exp3         1
#  7 Exp2     Exp4         2
#  8 Exp2     Exp5         2
#  9 Exp2     Exp6         1
# 10 Exp3     Exp4         2
# 11 Exp3     Exp5         3
# 12 Exp3     Exp6         2
# 13 Exp4     Exp5         1
# 14 Exp4     Exp6         2
# 15 Exp5     Exp6         1

`dplyr 1.0.0` or older

df_long %>%
  inner_join(df_long, by = "value") %>%
  filter(Exp_No.y > Exp_No.x) %>%
  count(Exp_No.x, Exp_No.y)

Data

df <- read.table(text = "
Exp_No    Clone1    Clone2   Clone3    Clone4
Exp1      Egxn2     Egxn11   Egxn6     Egxn13
Exp2      Egxn4     Egxn13   Egxn16    Egxn6
Exp3      Egxn2     Egxn6    Egxn11    Egxn18
Exp4      Egxn6     Egxn14   Egxn4     Egxn18
Exp5      Egxn2     Egxn11   Egxn6     Egxn13
Exp6      Egxn4     Egxn2    Egxn5     Egxn18", header = TRUE)

CodePudding user response：

The result will have a very large number of rows (tens of millions). For such a large dataset, tcrossprod on a sparse matrix should give good performance. Demonstrating with 10K rows and 148 columns:

library(Matrix) # for sparse matrices
library(data.table) # final solution will be stored as a data.table

m <- as(
  triu( # get the upper triangle of the symmetric matrix
    tcrossprod( # tcrossprod to get the pairwise common clone counts
      # convert the data.frame to a sparse matrix with nrow(df) rows and
      # length(unique(unlist(df[,-1]))) columns (the number of unique clones
      # in the dataset)
      sparseMatrix(
        rep(1:1e4, 148),
        as.numeric(gsub("Egxn", "", unlist(df[,-1], 0, 0))),
        x = 1L
      )
    ), k = 1 # don't keep the diagonal (comparing rows with themselves)
  ), "dgTMatrix" # set the result as a triangular matrix
)

# build the final answer
dtPairs <- setorder(
  data.table(
    # the Exp1 and Exp2 columns are row indices from df
    # sparse matrix indices are zero-based, so add one
    Exp1 = attr(m, "i")   1L,
    Exp2 = attr(m, "j")   1L,
    Common = attr(m, "x")
  ), Exp1, Exp2 # sort by Exp1 then by Exp2
)

dtPairs[1:10,]
#>     Exp1 Exp2 Common
#>  1:    1    2     21
#>  2:    1    3     29
#>  3:    1    4     24
#>  4:    1    5     25
#>  5:    1    6     32
#>  6:    1    7     17
#>  7:    1    8     23
#>  8:    1    9     20
#>  9:    1   10     25
#> 10:    1   11     24
nrow(dtPairs)
#> [1] 49995000

Data:

df <- cbind(
  data.frame(Exp_No = paste0("Exp", 1:1e4)),
  matrix(
    paste0("Egxn", replicate(1e4, sample(1e3, 148))),
    1e4, 148, 1, list(NULL, paste0("Clone", 1:148))
  )
)

Since dplyr 1.1.0

dplyr 1.0.0 or older

Data

Since `dplyr 1.1.0`

`dplyr 1.0.0` or older