Create an sparse matrix from a list of tuples having the indexes of the column where is a 1-CodePudding

Problem:

I have a list of tuples, which each tuple represents a column of a 2D-array and each element of the tuple represents the index of that column of the array that is a 1; the other entries that aren't in that tuple, are 0.

I want to create an sparse matrix with this list of tuples in an efficient way (trying to not use for loops).

Example:

# init values
list_tuples = [
 (0, 2, 4),
 (0, 2, 3),
 (1, 3, 4)
]

n = length(list_tuples)   1
m = 5 # arbritrary, however n >= max([ei for ei in list_tuples])   1

# what I need is a function which accepts this tuples and give the shape of the array
# (at least the row size, because the column size can be infered from the list of tuples)
A = some_function(list_tuples, array_shape = (m, n))

Then what I expect to have is an array of the form:

[   
 [1, 1, 0]
 [0, 0, 1]  
 [1, 1, 0]
 [0, 1, 1]
 [1, 0, 1]
]

CodePudding user response：

Your values are the indices that are required for the compressed sparse column format. You'll also need the indptr array, which for your data is the cumulative sum of the lengths of the tuples (prepended with 0). The data array would be an array of ones with the same length as the sum of the lengths of the tuples, which you can get from the last element of the cumulative sum. Here's how that looks with your example:

In [45]: from scipy.sparse import csc_matrix

In [46]: list_tuples = [
    ...:  (0, 2, 4),
    ...:  (0, 2, 3),
    ...:  (1, 3, 4)
    ...: ]

In [47]: indices = sum(list_tuples, ())  # Flatten the tuples into one sequence.

In [48]: indptr = np.cumsum([0]   [len(t) for t in list_tuples])

In [49]: a = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr))

In [50]: a
Out[50]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Column format>

In [51]: a.A
Out[51]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])

Note that csc_matrix inferred the number of rows from the maximum that it found in the indices. You can use the shape parameter to override that, e.g.

In [52]: b = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr), shape=(7, len(list_tuples)))

In [53]: b
Out[53]: 
<7x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Column format>

In [54]: b.A
Out[54]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1],
       [0, 0, 0],
       [0, 0, 0]])

You can also generate a coo_matrix pretty easily. The flattened list_tuples gives the row indices, and np.repeat can be used to create the column indices:

In [63]: from scipy.sparse import coo_matrix

In [64]: i = sum(list_tuples, ())  # row indices

In [65]: j = np.repeat(range(len(list_tuples)), [len(t) for t in list_tuples])

In [66]: c = coo_matrix((np.ones(len(i), dtype=int), (i, j)))

In [67]: c
Out[67]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in COOrdinate format>

In [68]: c.A
Out[68]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])