I have a large sparse array (Python csr). How can I find the indices of columns that are not entirely zeros?
For example, if the matrix looks like s
constructed below
In [13]: import scipy.sparse as sparse
In [14]: s=sparse.dok_matrix((2,4))
In [15]: s[0,0]=8; s[0,3]=9
In [16]: print (s.toarray())
[[8. 0. 0. 9.]
[0. 0. 0. 0.]]
The nonzero indices for the matrix s
will be [0,3].
CodePudding user response:
from scipy.sparse import csr_matrix
A = csr_matrix([[1,2,0],[0,0,3],[4,0,5]])
nonzero_indices = A.nonzero()
nonzero()
will return a tuple of two lists containing the indices you're looking for. For example:
for i,_ in enumerate(nonzero_indices[0]):
print(nonzero_indices[0][i], nonzero_indices[1][i])
will give
0 0
0 1
1 2
2 0
2 2
CodePudding user response:
I think you can use:
import numpy as np
np.nonzero((s!=0).sum(0))[1]
output: [0, 3]
CodePudding user response:
While the sum(0)
solution is correct, it seems to me it does a bit of extra work. What you need is easy to do in csc format.
import numpy as np
t = s.tocsc()
np.nonzero(t.indptr[:-1] != p.indptr[1:])[0]
If you need to process a large matrix this might work better, though I didn't test it.
Explanation
The t.indptr
stores indices to t.indices
and t.data
for each column in CSC format (row in CSR). The indices for row i
are stored in t.indices[t.indptr[i]:t.indptr[i 1]
: t.indptr[i]
is the starting position and t.indptr[i 1]
is the ending position. For your example, in CSC format t.indptr = [0, 1, 1, 1, 2]
. If t.indptr[i] == t.indptr[i 1]
then you have no elements corresponding to column (row) i
.
The conversion to CSC is the most costly operation here and it is linear in number of non-zero elements.