I have a large sparse Matrix. Based on the number of elements in a row, I want to remove the row of the sparse Matrix. The first step is to obtain the indexes of the row that has less that 5 elements.
number of rows = 4,000,000 number of columns = 250,000
The sparse matrix looks somewhat like this...
A header | C1 | C2 | C3 | C4 | C5 | ... | n |
---|---|---|---|---|---|---|---|
First | . | 1 | 1 | 1 | 1 | . | . |
Second | . | . | 1 | 1 | . | . | . |
Third | 1 | . | . | . | . | . | . |
Fourth | 1 | . | . | . | 1 | . | . |
nth | 1 | . | . | . | . | . | . |
I could use rowSums, and remove the row based on the output.
items <- c()
for(i in 1: nrow(sparse_matrix){
if(rowSums(sparse_matrix)[i] < 3){
items <- append(items, i)
}
}
However, this takes 1hr 30mins to go through around 10,000 rows which is really slow.
What would be an efficient solution to this?
CodePudding user response:
First, you calculate rowSums(sparse_matrix)
at every iteration of the loop, which is inefficient. Second, doing append(items, i)
in a loop is also inefficient.
My solution without a loop at all:
items = which(rowSums(sparse_matrix) < 3)
CodePudding user response:
Following up @AndreyShabalin's answer with an example:
example
Set up a matrix with the specified dimensions and 1 million non-zero elements:
nr <- 4e6; nc <- 2.5e4; ns <- 1e6
m <- Matrix(0, nrow = nr, ncol = nc)
set.seed(101)
i <- sample(nr, size = ns, replace = TRUE)
j <- sample(nc, size = ns, replace = TRUE)
m[cbind(i,j)] <- 1
subset
system.time(m2 <- m[rowSums(m) > 3, ]) ## 0.03 seconds
nrow(m2) ## 543
By avoiding (1) recomputing rowSums()
and (2) not growing a vector, we make this very fast.
library(microbenchmark)
microbenchmark(
logical = m[rowSums(m) > 3, ],
which = m[which(rowSums(m) > 3), ]
)
For reasons I don't understand, the solution with which
is actually slightly faster than the logical-indexing solution (median of 30 vs 35 milliseconds ...)
Unit: milliseconds
expr min lq mean median uq max neval cld
logical 33.13682 35.07974 45.65183 35.61339 37.09483 160.3382 100 b
which 27.90955 29.90988 37.06513 30.38864 31.76437 153.8702 100 a