I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.
This seems like it should be a simple combo of np.where
with np.repeat
or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):
# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
rowsums = np.sum(countsarray2D,1)
max_rowsum = np.max(rowsums)
out = []
for row_idx, row in enumerate(countsarray2D):
out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
for ele_idx in range(len(row)):
[out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ]
out.append(out_row)
return np.array(out)
# Working example
countsarray2D = np.array( [[1,2,0,1,3],
[0,0,0,0,3],
[0,1,1,0,0]] )
# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )
print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))
# Desired result:
array([[1 2 2 4 5 5 5]
[0 0 0 0 5 5 5]
[0 0 0 0 0 2 3]])
CodePudding user response:
One solution would be to flatten
the array and use np.repeat
.
This solution requires first adding the number of zeros to use as padding for each row to countsarray2D
. This can be done as follows:
counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]
The new countsarray2D
is then:
array([[0, 1, 2, 0, 1, 3],
[4, 0, 0, 0, 0, 3],
[5, 0, 1, 1, 0, 0]])
Now, we can flatten the array and use np.repeat
. An index array A
is used as the input array while countsarray2D
determines the number of times each index value should be repeated.
n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)
Final result:
array([[1, 2, 2, 4, 5, 5, 5],
[0, 0, 0, 0, 5, 5, 5],
[0, 0, 0, 0, 0, 2, 3]])