How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R-CodePudding

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.

I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.

library(Matrix)

# Generating the example matrices.
set.seed(42)

# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)

# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)

# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)

I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.

# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1@i, j = summary(m1)$j, x = m1@x)
position_matrix_from_m2 <- rbind(i = m2@i, j = summary(m2)$j, x = m2@x)

# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]

# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,]   1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,]   1

Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.

# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0

I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.

# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])

# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory. 
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)

# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2@x in the sparse matrix object.

Is there a fundamental error in my approach? What should I do to run this efficiently?

CodePudding user response：

Is there a fundamental error in my approach?

Yes. Here it is.

# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0

Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:

A <- matrix(1:9, 3, 3)
#     [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9
rn <- 1:2
cn <- 2:3
A[rn, cn]
#     [,1] [,2]
#[1,]    4    7
#[2,]    5    8

What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:

A[cbind(rn, cn)]
#[1] 4 8

So you need to fix your code to:

m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0

Oh wait... Based on your construction of position_matrix_from_m1, this is just

ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0

Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:

# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it

# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)

Now you can do:

ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0

Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2@x according to m1@x.

ind <- m1@x > 0 & m1@x < 0.05
m2@x[ind] <- 1
m1@x[ind] <- 0

A complete R session

I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.

library(Matrix)
# Generating the example matrices.
set.seed(42)

## reduce problem size to what my laptop can bear with
squeeze <- 0.1

# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)

# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)

# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)

## give me more usable RAM
rm(i, j, x1, x2)

##
## fix to your code
##

m1a <- m1
m2a <- m2

# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)

# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)

ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0

##
## the best solution
##

m1b <- m1
m2b <- m2

ind <- m1@x > 0 & m1@x < 0.05
m2b@x[ind] <- 1
m1b@x[ind] <- 0

##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE

Caveat:

I know that some people may propose

m1c <- m1
m2c <- m2

logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0

It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.