Scipy create sparse row matrix from a list of indices and a list of list data-CodePudding

suppose I have a list of list data, and a list containing the row number of each data, how to convert to a sparse matrix?

Example:

import numpy as np
data = np.array([[1,2,3],[4,5,6],[7,8,9]])
indices = np.array([0,0,4]) # row number, sum when duplicated

expected output is:

[[5, 7, 9], # row 0: [5,7,9]=[1,2,3] [4,5,6]
 [0, 0, 0],
 [0, 0, 0],
 [7, 8, 9]] # row 4

I understand that I can construct it using scipy.sparse.csr_matrix with data, and row, col, or indptr, but I now have already calculated data and indices, is there a way to simply construct a sparse matrix using these two? Thanks!

CodePudding user response：

According to the documentation, there is a constructor that utilizes the CSR information directly:

csr_matrix((data, indices, indptr), [shape=(M, N)])

So in your specific case, you could write it like:

data = np.array([1,2,3,4,5,6,7,8,9])
indices = np.array([0,1,2,0,1,2,0,1,2]) # col numbers
indptr = np.array([0,6,6,6,9]) # row pointers

mat = csr_matrix((data, indices, indptr), shape=(4, 3))

To get an example on how the CSR format works, you can take a look into sparse matrices. I will explain the code nonetheless:

First, the data needs to be flattened to a single list. The indices of the CSR format relate to the column-indices, while the indptr is used to point to the rows.

So having an indptr value of 0 at position 0 in the list tells us that the 1st row (position 1) of the matrix starts after 0 data entries. Similarly, a value of 6 at position 1 in the list tells us that the 2nd row (position 1) of the matrix starts after 6 data entries.

The column-indices list is as you would expect it to behave: data[i] is positioned in column indices[i].

CodePudding user response：

In [131]: data = np.array([[1,2,3],[4,5,6],[7,8,9]])
     ...: indices = np.array([0,0,3]) # row number, sum when duplicated

I corrected the indices for 0 based indexing.

We don't need sparse to sum the duplicates. There's a np.add.at that does this nicely:

In [135]: res = np.zeros((4,3),int)
In [136]: np.add.at(res, indices, data)
In [137]: res
Out[137]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])

If we make a csr from that:

In [141]: M = sparse.csr_matrix(res)
In [142]: M
Out[142]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [143]: M.data
Out[143]: array([5, 7, 9, 7, 8, 9])
In [144]: M.indices
Out[144]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [145]: M.indptr
Out[145]: array([0, 3, 3, 3, 6], dtype=int32)

To make a csr directly, it's often easier to use the coo style of inputs. They are easier to understand.

Those inputs are 3 1d arrays of the same size:

In [160]: data.ravel()
Out[160]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [161]: row = np.repeat(indices,3)
In [162]: row
Out[162]: array([0, 0, 0, 0, 0, 0, 3, 3, 3])
In [163]: col = np.tile(np.arange(3),3)
In [164]: col
Out[164]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])
In [165]: M1 = sparse.coo_matrix((data.ravel(),(rows, cols)))
In [166]: M1.data
Out[166]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

The coo format leaves the inputs as given; but on conversion to csr duplicates are summed.

In [168]: M2 = M1.tocsr()
In [169]: M2
Out[169]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [170]: M2.data
Out[170]: array([5, 7, 9, 7, 8, 9])
In [171]: M2.indices
Out[171]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [172]: M2.indptr
Out[172]: array([0, 3, 3, 3, 6], dtype=int32)

In [173]: M2.A
Out[173]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])

@Erik shows how to use the csr format directly:

In [174]: M3 =sparse.csr_matrix((data.ravel(), col, [0,6,6,6,9]))
In [175]: M3
Out[175]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Row format>
In [176]: M3.A
Out[176]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])
In [177]: M3.indices
Out[177]: array([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32)

Note this has 9 nonzero elements; it hasn't summed the duplicates for storage (though the .A display shows them summed). To sum, we need an extra step:

In [179]: M3.sum_duplicates()
In [180]: M3.data
Out[180]: array([5, 7, 9, 7, 8, 9])