Home > Software engineering >  How to find the indices of columns that are not entirely zeros of a sparse matrix
How to find the indices of columns that are not entirely zeros of a sparse matrix

Time:09-14

I have a large sparse array (Python csr). How can I find the indices of columns that are not entirely zeros?

For example, if the matrix looks like s constructed below

In [13]: import  scipy.sparse as sparse

In [14]: s=sparse.dok_matrix((2,4))

In [15]: s[0,0]=8; s[0,3]=9

In [16]:  print (s.toarray())
[[8. 0. 0. 9.]
 [0. 0. 0. 0.]]

The nonzero indices for the matrix s will be [0,3].

CodePudding user response:

from scipy.sparse import csr_matrix                                                                     
                                                                                                        
A = csr_matrix([[1,2,0],[0,0,3],[4,0,5]])                                                               
                                                                                                        
nonzero_indices = A.nonzero()                                                                           

nonzero() will return a tuple of two lists containing the indices you're looking for. For example:

for i,_ in enumerate(nonzero_indices[0]):                                                               
    print(nonzero_indices[0][i], nonzero_indices[1][i])   

will give

0 0
0 1
1 2
2 0
2 2

CodePudding user response:

I think you can use:

import numpy as np
np.nonzero((s!=0).sum(0))[1]

output: [0, 3]

CodePudding user response:

While the sum(0) solution is correct, it seems to me it does a bit of extra work. What you need is easy to do in csc format.

import numpy as np
t = s.tocsc()
np.nonzero(t.indptr[:-1] != p.indptr[1:])[0]

If you need to process a large matrix this might work better, though I didn't test it.

Explanation

The t.indptr stores indices to t.indices and t.data for each column in CSC format (row in CSR). The indices for row i are stored in t.indices[t.indptr[i]:t.indptr[i 1]: t.indptr[i] is the starting position and t.indptr[i 1] is the ending position. For your example, in CSC format t.indptr = [0, 1, 1, 1, 2]. If t.indptr[i] == t.indptr[i 1] then you have no elements corresponding to column (row) i.

The conversion to CSC is the most costly operation here and it is linear in number of non-zero elements.

  • Related