Home > OS >  how to coerce a data.frame into a sparse matrix in R
how to coerce a data.frame into a sparse matrix in R

Time:02-14

I am trying to follow the example here: cui2vecWorkflow by creating a matrix similar to the one here term_cooccurrence_matrix.rda that has the following properties:

> cooc<-get(load('~/development/cui2vec/vignettes/term_cooccurrence_matrix.rda'))
> str(cooc)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:2366] 0 1 2 0 1 2 3 4 3 5 ...
  ..@ p       : int [1:101] 0 1 2 3 7 8 10 17 19 27 ...
  ..@ Dim     : int [1:2] 100 100
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
  .. ..$ : chr [1:100] "C0016875" "C0162770" "C0024730" "C0038689" ...
  ..@ x       : num [1:2366] 412 6286 8280 118 110 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()

The dataframe I have looks like:

> test
        CUI1     CUI2 Count
1   C0000699 C3894683     2
2   C0000699 C0101725     1
3   C0000699 C1882413     3
4   C0000699 C0245531     3
5   C0000699 C0068475     2
6   C0000699 C0538927     3
7   C0000699 C0724693     1
8   C0000699 C0216784     2
9   C0000699 C2248020     1
10  C0000699 C0069449     3
...

but when I read it in and convert to a matrix it obviously won't be the same structure, as per

> mat <- as.matrix(test)
> str(mat)
 chr [1:1000000, 1:3] "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" "C0000699" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "CUI1" "CUI2" "Count" 

I then take the next step and coerce the matrix mat to a sparse matrix:

> mat <- as(mat,  "sparseMatrix")
> str(mat)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:3000000] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:4] 0 1000000 2000000 3000000
  ..@ Dim     : int [1:2] 1000000 3
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "CUI1" "CUI2" "Count"
  ..@ x       : num [1:3000000] NA NA NA NA NA NA NA NA NA NA ...
  ..@ factors : list()

but the structure looks wrong.

Trying this, I get an error:

> mat <- as(mat,  "dsCMatrix")
Error in asMethod(object) : 
  not a symmetric matrix; consider forceSymmetric() or symmpart()
In addition: Warning message:
In storage.mode(from) <- "double" : NAs introduced by coercion

So I try this:

> mat <- as(forceSymmetric(mat),  "dsCMatrix")
Error in forceSymmetric(mat) : 
  invalid class 'NA' to dup_mMatrix_as_geMatrix

(I haven't been able to find any examples for how to construct a matrix of the class structure("dsCMatrix", package = "Matrix") from a data.frame, so I am just winging it).

It looks like the Dim and Dimnames aren't defined properly, along with the value of x.

CodePudding user response:

Following user20650's comment, first coerce the CUI* columns to factor with the same levels, then use xtabs to create a sparse matrix, then add its transpose.

txt <- '
        CUI1     CUI2 Count
1   C0000699 C3894683     2
2   C0000699 C0101725     1
3   C0000699 C1882413     3
4   C0000699 C0245531     3
5   C0000699 C0068475     2
6   C0000699 C0538927     3
7   C0000699 C0724693     1
8   C0000699 C0216784     2
9   C0000699 C2248020     1
10  C0000699 C0069449     3
'
test <- read.table(textConnection(txt), header = TRUE)

library(Matrix)

levls <- Reduce(union, test[1:2])
test[1:2] <- lapply(test[1:2], factor, levels = levls)
res <- xtabs(Count ~ CUI1   CUI2, data = test, sparse = TRUE)
res <- forceSymmetric(res)
class(res)
#> [1] "dsCMatrix"
#> attr(,"package")
#> [1] "Matrix"

Created on 2022-02-13 by the reprex package (v2.0.1)

  • Related