Home > Software design >  scipy `SparseEfficiencyWarning` when division on rows of csr_matrix
scipy `SparseEfficiencyWarning` when division on rows of csr_matrix

Time:04-05

Suppose I already had a csr_matrix:

import numpy as np
from scipy.sparse import csr_matrix
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1., 2., 3., 4., 5., 6.])
mat = csr_matrix((data, indices, indptr), shape=(3, 3))
print(mat.A)

[[1. 0. 2.]
 [0. 0. 3.]
 [4. 5. 6.]]

it's simple if I want to divide a single row of this csr_matrix:

mat[0] /= 2
print(mat.A)

[[0.5 0.  1. ]
 [0.  0.  3. ]
 [4.  5.  6. ]]

However, if I want to change multiple rows, it throws an warning:

mat[np.array([0,1])]/=np.array([[1],[2]])
print(mat.A)

[[1.  0.  2. ]
 [0.  0.  1.5]
 [4.  5.  6. ]]

SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

How come division on multiple rows change its sparsity? It suggests me to change to lil_matrix, but when I checked the code of def tolil():

def tolil(self, copy=False):
    lil = self._lil_container(self.shape, dtype=self.dtype)

    self.sum_duplicates()
    ptr,ind,dat = self.indptr,self.indices,self.data
    rows, data = lil.rows, lil.data

    for n in range(self.shape[0]):
        start = ptr[n]
        end = ptr[n 1]
        rows[n] = ind[start:end].tolist()
        data[n] = dat[start:end].tolist()

    return lil

which basically loops all rows, I don't think it's necessary in my case. What may be the correct way if I simply want to divide a few rows of a csr_matrix? Thanks!

CodePudding user response:

Your matrix:

In [208]: indptr = np.array([0, 2, 3, 6])
     ...: indices = np.array([0, 2, 2, 0, 1, 2])
     ...: data = np.array([1., 2., 3., 4., 5., 6.])
     ...: mat = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [209]: mat
Out[209]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [210]: mat.A
Out[210]: 
array([[1., 0., 2.],
       [0., 0., 3.],
       [4., 5., 6.]])

Simple division just changes the mat.data values, in-place:

In [211]: mat/= 3
In [212]: mat
Out[212]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [213]: mat *= 3
In [214]: mat.A
Out[214]: 
array([[1., 0., 2.],
       [0., 0., 3.],
       [4., 5., 6.]])

The RHS of your case produces a np.matrix object:

In [215]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[215]: 
matrix([[1. , 0. , 2. ],
        [0. , 0. , 1.5]])

Assigning that to the mat subset produces the warning:

In [216]: mat[np.array([0,1])] = _
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

Your warning and mine occurs in the set step:

self._set_arrayXarray(i, j, x)

If I divide again I don't get the warning:

In [217]: mat[np.array([0,1])]/=np.array([[1],[2]])
In [218]: mat
Out[218]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

Why? Because after the first assignment, mat has 9 non-zero terms, not the original six. So [217] doesn't change sparsity.

Convert mat back to 6 zeros, and we get the warning again:

In [219]: mat.eliminate_zeros()
In [220]: mat
Out[220]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [221]: mat[np.array([0,1])]/=np.array([[1],[2]])
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

and a change in sparsity:

In [222]: mat
Out[222]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

Assigning a sparse matrix of [215] doesn't trigger the warning:

In [223]: mat.eliminate_zeros()
In [224]: m1=sparse.csr_matrix(mat[np.array([0,1])]/np.array([[1],[2]]))
In [225]: m1
Out[225]: 
<2x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [226]: mat[np.array([0,1])]=m1
In [227]: mat
Out[227]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

===

The [215] division is best seen as a ndarray action, not a sparse one:

In [232]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[232]: 
matrix([[1.     , 0.     , 2.     ],
        [0.     , 0.     , 0.09375]])
In [233]: mat[np.array([0,1])].todense()/np.array([[1],[2]])
Out[233]: 
matrix([[1.     , 0.     , 2.     ],
        [0.     , 0.     , 0.09375]])

The details of this division are found in sparse._base.py, mat._divide, with different actions depending whether the other is scalar, dense array, or sparse matrix. Sparse matrix division does not implement broadcasting.

As a general rule, matrix multiplication is the most efficient sparse calculation. In fact actions like row or column sum are implemented with it. And so are some forms of indexing. Element-wise calculations are ok if they can be applied to the M.data array without regard to row or column indices (e.g. square, power, scalar multiplication). M.multiply is element-wise, but without the full broadcasting power of dense arrays. Sparse division is even more limited.

edit

sklearn has some utilities to perform certain kinds of sparse actions that it needs, like scaling and normalizing.

In [274]: from sklearn.utils import sparsefuncs

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.utils

https://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.inplace_row_scale.html#sklearn.utils.sparsefuncs.inplace_row_scale

With the sample mat:

In [275]: mat
Out[275]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [276]: mat.A
Out[276]: 
array([[1.    , 0.    , 2.    ],
       [0.    , 0.    , 0.1875],
       [4.    , 5.    , 6.    ]])

Applying the row scaling:

In [277]: sparsefuncs.inplace_row_scale(mat,np.array([10,20,1]))
In [278]: mat
Out[278]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [279]: mat.A
Out[279]: 
array([[10.  ,  0.  , 20.  ],
       [ 0.  ,  0.  ,  3.75],
       [ 4.  ,  5.  ,  6.  ]])

The scaling array has to match in length. In your case you'd need to take the inverse of your [[1],[2]], and pad it with 1 to act on all rows.

Looking at the source I see it uses sparsefuncs.inplace_csr_row_scale. That in turn does:

X.data *= np.repeat(scale, np.diff(X.indptr))

The details of this action are:

In [283]: mat.indptr
Out[283]: array([0, 2, 3, 6], dtype=int32)
In [284]: np.diff(mat.indptr)
Out[284]: array([2, 1, 3], dtype=int32)
In [285]: np.repeat(np.array([10,20,1]), _)
Out[285]: array([10, 10, 20,  1,  1,  1])
In [286]: mat.data
Out[286]: array([100., 200.,  75.,   4.,   5.,   6.])

So it converts the scale array into an array that matches the data in shape. Then the inplace *= array multiplication is easy.

CodePudding user response:

It seems that the problem lies within the way of accessing the matrix data

mat[np.array([0,1])]/=np.array([[1],[2]])

The direct access to the rows seems to be the problem. I did not find any reliable data as to why this is the case, but I think that this forces the construction of the rows with all 0 values that the CSR format would normally skip.

A way to mitigate the problem would be to first write the data np.array([[1],[2]]) as 1/x to get [[1],[0.5]] and then create a diagonal csr_matrix from it so that we have

mat_x = csr_matrix(([1,0.5,1], [0,1,2], [0,1,2,3]), shape=(3, 3))

If we then multiply this matrix from the left to affect rows, we get the desired output without any warnings

mat = mat_x * mat
  • Related