Home > Mobile >  Efficient subsetting of each column of a matrix with a column of indices of another matrix
Efficient subsetting of each column of a matrix with a column of indices of another matrix

Time:10-22

I am currently facing this problem while working on R.

I have a big matrix, say A, of size (M x K) and a smaller matrix, say B, of size (N x K), with of course N < M. So B has the same number of columns as A, but less rows.

A possible example can be:

A <- replicate(K, rnorm(M))
B <- replicate(K, sample(M, N))

My goal is to subset each column of A with the correspondent column of B in an efficent way, since the dimension of the matrices is huge.

A possible numerical example is:

A <- [1 2 3     B <- [1 1 3   --->   [1 2 9
      4 5 6           3 2 2]          7 5 6]
      7 8 9]          

This means I am looking for something better than parsing column-by-column using a simple for loop:

 for(i in 1:K)
 { 
   A[B[,i],i] 
 }

Thank you for helping me to solve this apparently trivial, but for me challenging problem.

CodePudding user response:

See this question and ?'[':

When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x; the result is then a vector with elements corresponding to the sets of indices in each row of i.

For your example, it would be:

A <- matrix(1:9, 3, byrow = TRUE)
B <- matrix(c(1, 1, 3, 3, 2, 2), 2, byrow = TRUE)
C <- matrix(A[cbind(c(B), rep(1:ncol(B), each = nrow(B)))], nrow(B))

UPDATE: For efficiency, Mossa is right. There's not going to be much of a difference in performance in this case between using a matrix of indices and looping. Almost all of your time is in generating A. Generating B even takes longer than generating C:

M <- K <- 1e4; N <- 1e3

> system.time(A <- replicate(K, rnorm(M)))
   user  system elapsed 
  7.110   0.922   8.030 
> system.time(B <- replicate(K, sample(M, N)))
   user  system elapsed 
  0.727   0.033   0.760 
> system.time(C <- matrix(A[cbind(c(B), rep(1:ncol(B), each = nrow(B)))], nrow(B)))
   user  system elapsed 
  0.474   0.186   0.659

You can speed up the generation of A slightly by sampling all in one go, but you don't have much room for improvement in any of the A, B, C operations:

> system.time(A <- matrix(rnorm(M*K), M))
   user  system elapsed 
   6.82    0.74    7.56

CodePudding user response:

I believe that you've arrived at a pretty reasonable method:

    C <- matrix(NA, nrow = N, ncol = K)
    for (i in seq_len(K))
    {
      C[, i] <- A[B[, i], i]
    }
    C
  • Related