Sampling submatrices in R-CodePudding

Given a matrix like mat

> set.seed(1)
> mat <- matrix(rbinom(100,1,0.5),10,10)
> rownames(mat) <- paste0(sample(LETTERS[1:2],10,replace=T),c(1:nrow(mat)))
> colnames(mat) <- paste0(sample(LETTERS[1:2],10,replace=T),c(1:ncol(mat)))
> mat
    A1 A2 A3 B4 B5 B6 B7 A8 B9 B10
B1   0  0  1  0  1  0  1  0  0   0
B2   0  0  0  1  1  1  0  1  1   0
B3   1  1  1  0  1  0  0  0  0   1
A4   1  0  0  0  1  0  0  0  0   1
A5   0  1  0  1  1  0  1  0  1   1
A6   1  0  0  1  1  0  0  1  0   1
A7   1  1  0  1  0  0  0  1  1   0
B8   1  1  0  0  0  1  1  0  0   0
A9   1  0  1  1  1  1  0  1  0   1
A10  0  1  0  0  1  0  1  1  0   1

I want to sample submatrices of the form:

   A  B
A  0  1
B  1  0

EDIT: Specifically, the submatrix should contain a 1 in the A-row/B-column, a 1 in the B-row/A-column, and 0s in the other two cells.

I can do this by randomly selecting one A-row and one B-row, then randomly selecting one A-column and one B-column, then checking whether it has the desired pattern. But, I'm trying to find a more efficient method that will work even in large/sparse matrices. Thanks!

CodePudding user response：

You can sample on the dimension names:

set.seed(1)
dims <- lapply(dimnames(mat), \(x) c(sample(which(grepl("A", x)), 1), sample(which(grepl("B", x)), 1)))

mat[dims[[1]], dims[[2]]]

   A3 B4
A4  0  0
B8  0  0

CodePudding user response：

A solution that first finds all matching submatrices, then samples on them.

library(Matrix)

set.seed(1)
m <- as(matrix(rbinom(100,1,0.5),10,10),"sparseMatrix")
m2 <- as(m[-1, -ncol(m)]*m[-nrow(m), -1]*!(m[-1,-1] | m[-nrow(m), -ncol(m)]), "dgTMatrix")
length(m2@i) # number of matching submatrices
#> [1] 7
# sample matching submatrices
idx <- sample(length(m2@i), 5)
# put the row, column indices of the top left of the submatrices into a
# data.frame for further processing
(topleft <- data.frame(i = m2@i[idx], j = m2@j[idx]))   1L
#>   i j
#> 1 1 6
#> 2 6 9
#> 3 5 1
#> 4 7 7
#> 5 5 8

# timing on a 10k square matrix with 1M elements
idx <- sample(1e8, 1e6)
m <- sparseMatrix(i = ((idx - 1) %% 1e4)   1, j = ((idx - 1) %/% 1e4)   1)
system.time({
  m2 <- as(m[-1, -ncol(m)]*m[-nrow(m), -1]*!(m[-1,-1]   m[-nrow(m), -ncol(m)]), "dgTMatrix")
  # sample 1k matching submatrices
  idx <- sample(length(m2@i), 1e3)
  # put the row, column indices of the top left of the submatrices into a
  # data.frame for further processing
  topleft <- data.frame(i = m2@i[idx], j = m2@j[idx])   1L
})
#>    user  system elapsed 
#>    3.23    1.17    4.43
length(m2@i) # number of matching submatrices
#> [1] 9841