Function to convert coo row index to csr row pointers without using scipy-CodePudding

I am building out a function to convert coo row index (illustrated as 'rows' below) to csr row pointers (results stored in 'rows_v3' below) without using any package.

The logic is accurate and works on the example below. However, when I ran this on a row with a length of 100k, the code keeps running on forever without completing. I suspect this had to do with the fact that I have nested loops in the function below - is that right? And if so, how can I tweak this function to address this runtime issue?

rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]
# number of elements in csr_ptrs
unique_rows = len(set(rows))  1

rows_v3 = [0] * unique_rows
for i in range(0, unique_rows):
  if i == 0:
    rows_v3[i] = 0
  else:
    nzz = len([x for x in rows if x == (i-1)])
    rows_v3[i] = rows_v3[i-1]   nzz
rows_v3

CodePudding user response：

There's no need to nest loops for this.

rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]

# You need to use this instead of the length of the unique elements
# Rows with no non-zero values are still rows
# Even this will miss all-zero rows at the bottom of the matrix
n_rows = max(rows)   1

# Create a pointer list
indptr = [0] * n_rows

# Count the number of values in each row
for i in rows:
    indptr[i]  = 1

# Do a cumsum
for i in range(n_rows - 1):
    indptr[i   1]  = indptr[i]

# Add a zero on the front of the pointer list
# If that's the style of indptr you're doing
intptr = [0]   indptr

CodePudding user response：

First lets make a sparse matrix for reference:

In [36]: from scipy import sparse
In [37]: rows = [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6]
In [38]: rows = np.array(rows)
In [39]: data = np.ones(rows.shape, dtype=int)
In [40]: cols = np.arange(len(data))
In [41]: M = sparse.coo_matrix((data, (rows, cols)))
In [42]: M
Out[42]: 
<7x22 sparse matrix of type '<class 'numpy.int64'>'
    with 22 stored elements in COOrdinate format>
In [43]: M
Out[43]: 
<7x22 sparse matrix of type '<class 'numpy.int64'>'
    with 22 stored elements in COOrdinate format>
In [44]: M.A
Out[44]: 
array([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]])

I created cols so that there are no duplicates. Otherwise conversion to csr will sum duplicates, and make our task more complicated.

In [45]: Mr = M.tocsr()
In [46]: Mr.indptr
Out[46]: array([ 0,  3,  6, 10, 14, 17, 20, 22], dtype=int32)

To get the indptr directly (again no-duplicates):

In [48]: x = np.diff(rows)
In [49]: x
Out[49]: array([0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0])

Add start and ending 1's:

In [50]: x1 = np.concatenate(([1], x, [1]))
In [51]: x1
Out[51]: 
array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       1])
In [52]: np.nonzero(x1)[0]
Out[52]: array([ 0,  3,  6, 10, 14, 17, 20, 22])