Home > database >  Convert to matrix but keep one diagonal to NULL in R
Convert to matrix but keep one diagonal to NULL in R

Time:03-02

I have a huge dataset and that look like this. To save some memory I want to calculate the pairwise distance but leave the upper diagonal of the matrix to NULL.

library(tidyverse)
library(stringdist)
#> 
#> Attaching package: 'stringdist'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

df3 <- tibble(fruits=c("apple","banana","ananas","apple","ananas","apple","ananas"),
              position=c("135","135","135","136","137","138","138"), 
              counts = c(100,200,100,30,40,50,100))

stringdistmatrix(df3$fruits, method=c("osa"), nthread = 4) %>% 
  as.matrix()
#>   1 2 3 4 5 6 7
#> 1 0 5 5 0 5 0 5
#> 2 5 0 2 5 2 5 2
#> 3 5 2 0 5 0 5 0
#> 4 0 5 5 0 5 0 5
#> 5 5 2 0 5 0 5 0
#> 6 0 5 5 0 5 0 5
#> 7 5 2 0 5 0 5 0

Created on 2022-03-01 by the reprex package (v2.0.1)

However when I convert my stringdistmatrix to matrix (This step is essential for me), my upper diagonal get filled with numbers.

Is there anyway to convert to matrix but keep upper diagonal to NULL and save memory?

I want my data to look like this

  1 2 3 4 5 6
2 5          
3 5 2        
4 0 5 5      
5 5 2 0 5    
6 0 5 5 0 5  
7 5 2 0 5 0 5

CodePudding user response:

I think you may need to use sparse matrices. Package Matrix has such a possibility. You can learn more about sparse matrices at: Sparse matrix

library(Matrix)

m <- sparseMatrix(i = c(1:3, 2:3, 3), j=c(1:3,1:2, 1), x = 1, triangular = T)

m

#> 3 x 3 sparse Matrix of class "dtCMatrix"
#>           
#> [1,] 1 . .
#> [2,] 1 1 .
#> [3,] 1 1 1

To check the size of the matrices, one can use function object.size.

It seems that for small matrices, using sparse matrices makes no difference, but, for large matrices, the memory savings are considerable:

library(Matrix)

n <- 30
m1 <- matrix(1,n,n)
m2 <- Matrix(m1, sparse = TRUE) 

object.size(m1)
#> 7416 bytes

object.size(m2)
#> 7432 bytes

n <- 300
m1 <- matrix(1,n,n)
m2 <- Matrix(m1, sparse = TRUE) 

object.size(m1)
#> 720216 bytes

object.size(m2)
#> 544728 bytes

CodePudding user response:

It's not really common to have empty values in a matrix, but you can fill the upper triangular matrix with NAs:

library(tidyverse)
library(stringdist)

mat <- stringdistmatrix(df3$fruits, method=c("osa"), nthread = 4) %>% 
  as.matrix() 

mat[!lower.tri(mat)] <- NA

output

> mat
   1  2  3  4  5  6  7
1 NA NA NA NA NA NA NA
2  5 NA NA NA NA NA NA
3  5  2 NA NA NA NA NA
4  0  5  5 NA NA NA NA
5  5  2  0  5 NA NA NA
6  0  5  5  0  5 NA NA
7  5  2  0  5  0  5 NA

data

df3 <- tibble(fruits=c("apple","banana","ananas","apple","ananas","apple","ananas"),
              position=c("135","135","135","136","137","138","138"), 
              counts = c(100,200,100,30,40,50,100))

  • Related