I have a CSR matrix, and I want to be able to retrieve the column indices and the values.
Here is how I create the matrix (using csr_matrix from scipy.sparse):
indptr = np.empty(nbr_of_rows 1) # nbr_of_rows = 134,465
indptr[0] = 0
for i in range(1, len(indptr)):
indptr[i] = indptr[i-1] len(data[i-1]) # type(data) = list ; len(data) = 134,465 ; type(data[0]) = numpy.darray (each subarray has a different length)
data = np.concatenate(data).ravel() # now I have type(data) = numpy.darray ; len(data) = 2,821,574
ind = np.concatenante(ind).ravel # same than above
X = csr_matrix((data, ind, indptr), shape=(nbr_of_rows, nbr_of_columns)) # nbr_of_columns = 3,991
print(f"The matrix has a shape of {X.shape} and a sparsity of {(1 - (X.nnz / (X.shape[0] * X.shape[1]))): .2%}.")
# OUT: The matrix has a shape of (134465, 3991) and a sparsity of 99.47%.
So far so good (at least I think so). But now, even though I manage to retrieve the column indices, I can’t successfully retrieve the values:
np.alltrue(ind == X.nonzero()[1]) # True
np.alltrue(data == X[X.nonzero()]) # False
When I look deeper, I find that I get almost all the values (only a small amount of mistakes):
len(data) == len(X[X.nonzero()].tolist()[0]) # True
len(np.argwhere((data==X[X.nonzero()]) == False)) # 2184
So I get "only" 2,184 wrong values out of 2,821,574 total values.
Can someone please help me in getting all the correct values from my CSR matrix?
CodePudding user response:
Depending on the type of the values you store in the matrix, numpy.float64
or numpy.int64
, perhaps, the following post might answer your question: https://github.com/scipy/scipy/issues/13329#issuecomment-753541268
In particular, the comment "Apparently I don't get an error when data is a numpy array rather than a list." suggests that having data
as numpy.array
rather than a list
could solve your problem.
Hopefully, this at least sets you on the right track.
CodePudding user response:
Without your data
I can't replicate your problem, and probably wouldn't want to do so even with such a large array.
But I'll try to illustrate what I expect to happen when constructing a matrix this way. From another question I have a small matrix in a Ipython session:
In [60]: Mx
Out[60]:
<1x3 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>
In [61]: Mx.A
Out[61]: array([[0, 1, 2]], dtype=int32)
nonzero
returns the coo
format indices, row, col
In [62]: Mx.nonzero()
Out[62]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
The csr attributes are:
In [63]: Mx.data,Mx.indices,Mx.indptr
Out[63]:
(array([1, 2], dtype=int32),
array([1, 2], dtype=int32),
array([0, 2], dtype=int32))
Now lets make a new matrix, using the attributes of Mx
. Assuming you constructed your indptr
, indices
, and data
correctly this should imitate what you've done:
In [64]: newM = sparse.csr_matrix((Mx.data, Mx.indices, Mx.indptr))
In [65]: newM.A
Out[65]: array([[0, 1, 2]], dtype=int32)
data
matches between the two matrices:
In [68]: Mx.data==newM.data
Out[68]: array([ True, True])
id
of the data
don't match, but their bases do. See my recent answer to see why this is relevant
https://stackoverflow.com/a/74543855/901925
In [75]: id(Mx.data.base), id(newM.data.base)
Out[75]: (2255407394864, 2255407394864)
That means changes to newA
will appear in Mx
:
In [77]: newM[0,1] = 100
In [78]: newM.A
Out[78]: array([[ 0, 100, 2]], dtype=int32)
In [79]: Mx.A
Out[79]: array([[ 0, 100, 2]], dtype=int32)
fuller test
Let's try a small scale test of your code:
In [92]: data = np.array([[1.23,2],[3],[]],object); ind = np.array([[1,2],[3],[]],object)
...: indptr = np.empty(4)
...: indptr[0] = 0
...: for i in range(1, 4):
...: indptr[i] = indptr[i-1] len(data[i-1])
...: data = np.concatenate(data).ravel()
...: ind = np.concatenate(ind).ravel() # same than above
In [93]: data,ind,indptr
Out[93]: (array([1.23, 2. , 3. ]), array([1., 2., 3.]), array([0., 2., 3., 3.]))
And the sparse matrix:
In [94]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,3))
In [95]: X
Out[95]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
data
matches:
In [96]: X.data
Out[96]: array([1.23, 2. , 3. ])
In [97]: data == X.data
Out[97]: array([ True, True, True])
and is infact a view
:
In [98]: data[1] =.23; data
Out[98]: array([1.23, 2.23, 3. ])
In [99]: X.A
Out[99]:
array([[0. , 1.23, 2.23],
[0. , 0. , 0. ],
[3. , 0. , 0. ]])
oops
I made an error in specifying the X
shape:
In [110]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,4))
In [111]: X.A
Out[111]:
array([[0. , 1.23, 2.23, 0. ],
[0. , 0. , 0. , 3. ],
[0. , 0. , 0. , 0. ]])
In [112]: X.data
Out[112]: array([1.23, 2.23, 3. ])
In [113]: X.nonzero()
Out[113]: (array([0, 0, 1], dtype=int32), array([1, 2, 3], dtype=int32))
In [114]: X[X.nonzero()]
Out[114]: matrix([[1.23, 2.23, 3. ]])
In [115]: data
Out[115]: array([1.23, 2.23, 3. ])
In [116]: data == X[X.nonzero()]
Out[116]: matrix([[ True, True, True]])