Efficiently convert Numpy 2D array of counts to zero-padded 2D array of indices?-CodePudding

I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.

This seems like it should be a simple combo of np.where with np.repeat or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):

# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
    rowsums = np.sum(countsarray2D,1)
    max_rowsum = np.max(rowsums)
    out = []
    for row_idx, row in enumerate(countsarray2D):
        out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
        for ele_idx in range(len(row)):
            [out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ] 
        out.append(out_row)
    return np.array(out)

# Working example
countsarray2D = np.array( [[1,2,0,1,3],
                           [0,0,0,0,3],
                           [0,1,1,0,0]] )

# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )

print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))

# Desired result:
array([[1 2 2 4 5 5 5]
       [0 0 0 0 5 5 5]
       [0 0 0 0 0 2 3]])

CodePudding user response：

One solution would be to flatten the array and use np.repeat.

This solution requires first adding the number of zeros to use as padding for each row to countsarray2D. This can be done as follows:

counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]

The new countsarray2D is then:

array([[0, 1, 2, 0, 1, 3],
       [4, 0, 0, 0, 0, 3],
       [5, 0, 1, 1, 0, 0]])

Now, we can flatten the array and use np.repeat. An index array A is used as the input array while countsarray2D determines the number of times each index value should be repeated.

n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)

Final result:

array([[1, 2, 2, 4, 5, 5, 5],
       [0, 0, 0, 0, 5, 5, 5],
       [0, 0, 0, 0, 0, 2, 3]])