I am building out a function to convert coo row index (illustrated as 'rows' below) to csr row pointers (results stored in 'rows_v3' below) without using any package.
The logic is accurate and works on the example below. However, when I ran this on a row with a length of 100k, the code keeps running on forever without completing. I suspect this had to do with the fact that I have nested loops in the function below - is that right? And if so, how can I tweak this function to address this runtime issue?
rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]
# number of elements in csr_ptrs
unique_rows = len(set(rows)) 1
rows_v3 = [0] * unique_rows
for i in range(0, unique_rows):
if i == 0:
rows_v3[i] = 0
else:
nzz = len([x for x in rows if x == (i-1)])
rows_v3[i] = rows_v3[i-1] nzz
rows_v3
CodePudding user response:
There's no need to nest loops for this.
rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]
# You need to use this instead of the length of the unique elements
# Rows with no non-zero values are still rows
# Even this will miss all-zero rows at the bottom of the matrix
n_rows = max(rows) 1
# Create a pointer list
indptr = [0] * n_rows
# Count the number of values in each row
for i in rows:
indptr[i] = 1
# Do a cumsum
for i in range(n_rows - 1):
indptr[i 1] = indptr[i]
# Add a zero on the front of the pointer list
# If that's the style of indptr you're doing
intptr = [0] indptr
CodePudding user response:
First lets make a sparse matrix for reference:
In [36]: from scipy import sparse
In [37]: rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]
In [38]: rows = np.array(rows)
In [39]: data = np.ones(rows.shape, dtype=int)
In [40]: cols = np.arange(len(data))
In [41]: M = sparse.coo_matrix((data, (rows, cols)))
In [42]: M
Out[42]:
<7x22 sparse matrix of type '<class 'numpy.int64'>'
with 22 stored elements in COOrdinate format>
In [43]: M
Out[43]:
<7x22 sparse matrix of type '<class 'numpy.int64'>'
with 22 stored elements in COOrdinate format>
In [44]: M.A
Out[44]:
array([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]])
I created cols
so that there are no duplicates. Otherwise conversion to csr will sum duplicates, and make our task more complicated.
In [45]: Mr = M.tocsr()
In [46]: Mr.indptr
Out[46]: array([ 0, 3, 6, 10, 14, 17, 20, 22], dtype=int32)
To get the indptr
directly (again no-duplicates):
In [48]: x = np.diff(rows)
In [49]: x
Out[49]: array([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0])
Add start and ending 1's:
In [50]: x1 = np.concatenate(([1], x, [1]))
In [51]: x1
Out[51]:
array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
1])
In [52]: np.nonzero(x1)[0]
Out[52]: array([ 0, 3, 6, 10, 14, 17, 20, 22])