Scipy function that can do np.diff() with compressed sparse column matrix-CodePudding

I want to compute discrete difference of identity matrix. The code below use numpy and scipy.

import numpy as np
from scipy.sparse import identity
from scipy.sparse import csc_matrix

x = identity(4).toarray()
y = csc_matrix(np.diff(x, n=2))
print(y)

I would like to improve performance or memory usage. Since identity matrix produce many zeros, it would reduce memory usage to perform calculation in compressed sparse column(csc) format. However, np.diff() does not accept csc format, so converting between csc and normal format using csc_matrix would slow it down a bit.

Normal format

x = identity(4).toarray()
print(x)
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

csc format

x = identity(4)
print(x)
  (0, 0)    1.0
  (1, 1)    1.0
  (2, 2)    1.0
  (3, 3)    1.0

Thanks

CodePudding user response：

Here is my hacky solution to get the sparse matrix as you want.

L - the length of the original identity matrix,
n - the parameter of np.diff.

In your question they are:

L = 4
n = 2

My code produces the same y as your code, but without the conversions between csc and normal formats.

Your code:

from scipy.sparse import identity, csc_matrix

x = identity(L).toarray()
y = csc_matrix(np.diff(x, n=n))

My code:

from scipy.linalg import pascal

def get_data(n, L):
    nums = pascal(n   1, kind='lower')[-1].astype(float)
    minuses_from = n % 2   1
    nums[minuses_from : : 2] *= -1
    return np.tile(nums, L - n)

data = get_data(n, L)
row_ind = (np.arange(n   1)   np.arange(L - n).reshape(-1, 1)).flatten()
col_ind = np.repeat(np.arange(L - n), n   1)

y = csc_matrix((data, (row_ind, col_ind)), shape=(L, L - n))

I have noticed that after applying np.diff to the identity matrix n times, the values of the columns are the binomial coefficients with their signs alternating. This is my variable data.

Then I am just constructing the csc_matrix.

CodePudding user response：

Unfortunately, it does not seem that SciPy provides any tools for this kind of sparse matrix manipulation. Regardless, by cleverly manipulating the indices and data of the entries one can emulate np.diff(x,n) in a straightforward fashion.

Given 2D NumPy array (matrix) of dimension MxN, np.diff() multiplies each column (of column index y) with -1 and adds the next column to it (column index y 1). Difference of order k is just the iterative application of k differences of order 1. A difference of order 0 is just the returns the input matrix.

The method below makes use of this, iterateively eliminating duplicate entries by addition through sum_duplicates(), reducing the number of columns by one, and filtering non-valid indices.

def csc_diff(x, n):
    '''Emulates np.diff(x,n) for a sparse matrix by iteratively taking difference of order 1'''
    assert isinstance(x, csc_matrix) or (isinstance(x, np.ndarray) & len(x.shape) == 2), "Input matrix must be a 2D np.ndarray or csc_matrix."
    assert isinstance(n, int) & n >= 0, "Integer n must be larger or equal to 0."
    
    if n >= x.shape[1]:
        return csc_matrix(([], ([], [])), shape=(x.shape[0], 0))
    
    if isinstance(x, np.ndarray):
        x = csc_matrix(x)
        
    # set-up of data/indices via column-wise difference
    if(n > 0):
        for k in range(1,n 1):
            # extract data/indices of non-zero entries of (current) sparse matrix
            M, N = x.shape
            idx, idy = x.nonzero()
            dat = x.data
        
            # difference: this row (y) * (-1)   next row (y 1)
            idx = np.concatenate((idx, idx))
            idy = np.concatenate((idy, idy-1))
            dat = np.concatenate(((-1)*dat, dat))
            
            # filter valid indices
            validInd = (0<=idy) & (idy<N-1)

            # x_diff: csc_matrix emulating np.diff(x,1)'s output'
            x_diff =  csc_matrix((dat[validInd], (idx[validInd], idy[validInd])), shape=(M, N-1))
            x_diff.sum_duplicates()
            
            x = x_diff

    return x

Moreover, the method outputs an empty csc_matrix of dimension Mx0 when the difference order is larger or equal to the number of columns of the input matrix. This is why the output is identical, see

csc_diff(x, 2).toarray()
> array([[ 1.,  0.],
         [-2.,  1.],
         [ 1., -2.],
         [ 0.,  1.]])

which is identical to

np.diff(x.toarray(), 2)
> array([[ 1.,  0.],
         [-2.,  1.],
         [ 1., -2.],
         [ 0.,  1.]])

This identity holds for other difference orders, too

(csc_diff(x, 0).toarray() == np.diff(x.toarray(), 0)).all()
>True

(csc_diff(x, 3).toarray() == np.diff(x.toarray(), 3)).all()
>True

(csc_diff(x, 13).toarray() == np.diff(x.toarray(), 13)).all()
>True