thanks for taking the time to look at my question.
I'm trying to create a program that can add an array of values to each "column" in a sparse matrix, like so:
A = [[1,1,0],
[0,0,1],
[1,0,1]]
B = [[1],
[0],
[1]]
A B = [[2,2,1],
[0,0,1],
[2,1,2]]
Represented in the coordinate format for sparse matrices it would look as follows:
A = [[0,0,1],
[2,0,1],
[0,1,1],
[1,2,1],
[2,2,1]]
B = [[0,0,1],
[2,0,1]]
A B = [[0,0,2],
[2,0,2],
[0,1,2],
[2,1,1],
[0,2,1],
[1,2,1],
[2,2,2]]
I am dealing with large matrices that must be represented in a sparse manner due to their size. I need to be able to add a column of values to each column in a matrix, but with an algorithm that deals exclusively with the sparse triplets.
I have spent all day on this, literally 10 hours straight, and I am genuinely stunned I haven't been able to find a good answer for this anywhere. Performing this operation with multiplication is simple and highly efficient, but there doesn't seem to be any time and space-efficient solution built into scipy or numpy to do this, or if there is, it will kill me when I find out. I attempted to implement a solution, but it ended up being horribly inefficient.
Essentially, my solution, which does technically work but just with terrible time efficiency, follows these steps:
- Check for shared values on rows between A and B, add the relevant triplet values together
- Add the unique values from A
- Add B_row_i, x_i, B_value_i for i in the columns of the matrix, checking to see that we're not adding a verbatim value from our A triplets.
At least, I think that's what it does... I'm totally burnt out now and I started to phase out while coding. If anyone could suggest and fast solutions that would be highly appreciated!
from scipy.sparse import coo_matrix
from tqdm import tqdm
class SparseCoordinates:
def __init__(self,coo_a,coo_b):
self.shape = coo_a.shape
self.coo_a_rows = coo_a.row
self.coo_a_cols = coo_a.col
self.coo_a_data = coo_a.data
self.coo_b_rows = coo_b.row
self.coo_b_cols = coo_b.col
self.coo_b_data = coo_b.data
self.coo_c_rows = []
self.coo_c_cols = []
self.coo_c_data = []
def __check(self,a,b,c,lr,lc,lv):
for i in range(len(lr)):
if lr[i] == a and lc[i] == b and lv[i] == c:
return True
return False
def __check_shared_rows(self):
for i in tqdm(range(len(self.coo_a_rows))):
for j in range(len(self.coo_b_rows)):
if self.coo_a_rows[i] == self.coo_b_rows[j]:
self.coo_c_rows.append(self.coo_a_rows[i])
self.coo_c_cols.append(self.coo_a_cols[i])
self.coo_c_data.append(self.coo_a_data[i] self.coo_b_data[j])
def __add_unique_from_a(self):
a_unique = set(self.coo_a_rows) - set(self.coo_b_rows)
for i in tqdm(range(len(self.coo_a_rows))):
if self.coo_a_rows[i] in a_unique:
self.coo_c_rows.append(self.coo_a_rows[i])
self.coo_c_cols.append(self.coo_a_cols[i])
self.coo_c_data.append(self.coo_a_data[i])
def __add_all_remaining_from_b(self):
for i in tqdm(range(len(self.coo_b_rows))):
for j in range(self.shape[1]):
if not self.__check(self.coo_b_rows[i],j,self.coo_b_data[i],
self.coo_a_rows,self.coo_a_cols,self.coo_a_data):
self.coo_c_rows.append(self.coo_b_rows[i])
self.coo_c_cols.append(j)
self.coo_c_data.append(self.coo_b_data[i])
def add(self):
self.__check_shared_rows()
self.__add_unique_from_a()
self.__add_all_remaining_from_b()
return coo_matrix((self.coo_c_data,(self.coo_c_rows,self.coo_c_cols)),shape=self.shape)
CodePudding user response:
For numpy
arrays, A B
does the job because of broadcasting
. broadcasting
is implemented at core iteration levels, taking advantage of strides
. scipy.sparse
does not implement broadcasting
. If B
is expanded to (3,3) matrix to match A
addition does work
In [76]: A = np.array([[1,1,0],
...: [0,0,1],
...: [1,0,1]])
...:
...: B = np.array([[1],
...: [0],
...: [1]])
In [77]: B.shape
Out[77]: (3, 1)
In [78]: A B
Out[78]:
array([[2, 2, 1],
[0, 0, 1],
[2, 1, 2]])
Sparse:
In [79]: from scipy import sparse
In [81]: M=sparse.csr_matrix(A);N=sparse.csr_matrix(B)
Matrix multiplication is well developed for sparse matrices:
In [82]: M@N
Out[82]:
<3x1 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [84]: N.T@M
Out[84]:
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Column format>
Matrix multiplication is used for row or column indexing, and for summing.
Define a helper:
In [86]: O=sparse.csr_matrix([1,1,1])
In [87]: O
Out[87]:
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [88]: N@O
Out[88]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
Use that in the sum:
In [89]: M N@O
Out[89]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [90]: _.A
Out[90]:
array([[2, 2, 1],
[0, 0, 1],
[2, 1, 2]])
This matrix is less sparse than M
- fewer 0's. That reduces the benefits of using sparse matrices.
coo_matrix
format is used to create matrices, and to build new ones, as with sparse.bmat
, sparse.hstack
and sparse.vstack
, but csr_matrix
is used most math. Sometimes it is possible to do custom math by iterating on 'rows' via the intptr
array. There have been various SO answers which do that. Math using the coo
attributes is generally not a good idea.
However, because coo
duplicates are summed when converted to csr
, I could probably set up a reasonable summation of the coo formats of M
and N
.
In [91]: Mo=M.tocoo(); No=N.tocoo()
In [95]: Oo=O.tocoo()
In [98]: rows = np.concatenate((Mo.row, Oo.row))
In [99]: cols = np.concatenate((Mo.col, Oo.col))
In [100]: data = np.concatenate((Mo.data, Oo.data))
In [101]: sparse.coo_matrix((data,(rows,cols)),M.shape)
Out[101]:
<3x3 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in COOrdinate format>
In [102]: _.A
Out[102]:
array([[2, 2, 1],
[0, 0, 1],
[1, 0, 1]])
Here I joined the attributes of Mo
and Oo
. I could do the same with the attributes of No
, but since I already has Oo
, I decided it was worth the work. I'll leave that up to you.
In [103]: Oo.row,Oo.col,Oo.data
Out[103]:
(array([0, 0, 0], dtype=int32),
array([0, 1, 2], dtype=int32),
array([1, 1, 1]))
In [104]: No.row,No.col,No.data
Out[104]: (array([0, 2], dtype=int32), array([0, 0], dtype=int32), array([1, 1]))
I don't know if this coo
approach is worth the effort.