suppose I have a list of list data, and a list containing the row number of each data, how to convert to a sparse matrix?
Example:
import numpy as np
data = np.array([[1,2,3],[4,5,6],[7,8,9]])
indices = np.array([0,0,4]) # row number, sum when duplicated
expected output is:
[[5, 7, 9], # row 0: [5,7,9]=[1,2,3] [4,5,6]
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]] # row 4
I understand that I can construct it using scipy.sparse.csr_matrix
with data, and row, col, or indptr, but I now have already calculated data
and indices
, is there a way to simply construct a sparse matrix using these two? Thanks!
CodePudding user response:
According to the documentation, there is a constructor that utilizes the CSR information directly:
csr_matrix((data, indices, indptr), [shape=(M, N)])
So in your specific case, you could write it like:
data = np.array([1,2,3,4,5,6,7,8,9])
indices = np.array([0,1,2,0,1,2,0,1,2]) # col numbers
indptr = np.array([0,6,6,6,9]) # row pointers
mat = csr_matrix((data, indices, indptr), shape=(4, 3))
To get an example on how the CSR format works, you can take a look into sparse matrices. I will explain the code nonetheless:
First, the data
needs to be flattened to a single list. The indices
of the CSR format relate to the column-indices, while the indptr
is used to point to the rows.
So having an indptr
value of 0
at position 0
in the list tells us that the 1st row (position 1)
of the matrix starts after 0
data
entries. Similarly, a value of 6
at position 1
in the list tells us that the 2nd row (position 1)
of the matrix starts after 6
data
entries.
The column-indices list is as you would expect it to behave: data[i]
is positioned in column indices[i]
.
CodePudding user response:
In [131]: data = np.array([[1,2,3],[4,5,6],[7,8,9]])
...: indices = np.array([0,0,3]) # row number, sum when duplicated
I corrected the indices for 0 based indexing.
We don't need sparse
to sum the duplicates. There's a np.add.at
that does this nicely:
In [135]: res = np.zeros((4,3),int)
In [136]: np.add.at(res, indices, data)
In [137]: res
Out[137]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
If we make a csr
from that:
In [141]: M = sparse.csr_matrix(res)
In [142]: M
Out[142]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [143]: M.data
Out[143]: array([5, 7, 9, 7, 8, 9])
In [144]: M.indices
Out[144]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [145]: M.indptr
Out[145]: array([0, 3, 3, 3, 6], dtype=int32)
To make a csr
directly, it's often easier to use the coo
style of inputs. They are easier to understand.
Those inputs are 3 1d arrays of the same size:
In [160]: data.ravel()
Out[160]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [161]: row = np.repeat(indices,3)
In [162]: row
Out[162]: array([0, 0, 0, 0, 0, 0, 3, 3, 3])
In [163]: col = np.tile(np.arange(3),3)
In [164]: col
Out[164]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])
In [165]: M1 = sparse.coo_matrix((data.ravel(),(rows, cols)))
In [166]: M1.data
Out[166]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
The coo
format leaves the inputs as given; but on conversion to csr
duplicates are summed.
In [168]: M2 = M1.tocsr()
In [169]: M2
Out[169]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [170]: M2.data
Out[170]: array([5, 7, 9, 7, 8, 9])
In [171]: M2.indices
Out[171]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [172]: M2.indptr
Out[172]: array([0, 3, 3, 3, 6], dtype=int32)
In [173]: M2.A
Out[173]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
@Erik shows how to use the csr
format directly:
In [174]: M3 =sparse.csr_matrix((data.ravel(), col, [0,6,6,6,9]))
In [175]: M3
Out[175]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
In [176]: M3.A
Out[176]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
In [177]: M3.indices
Out[177]: array([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32)
Note this has 9 nonzero elements; it hasn't summed the duplicates for storage (though the .A
display shows them summed). To sum, we need an extra step:
In [179]: M3.sum_duplicates()
In [180]: M3.data
Out[180]: array([5, 7, 9, 7, 8, 9])