I have an array of repeated values that are used to match datapoints to some ID. How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids
and lengths
, but only x
.
It does not matter what the values in x
are, I just want an array with counting up integers which are repeated the same amount as in x
.
I can come up with solutions using for-loops or np.unique
, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x
and returns an array like y
?
CodePudding user response:
You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)