Home > Mobile >  Apply a cumcount to each group of identical integers
Apply a cumcount to each group of identical integers

Time:01-02

Say I have the following array of ascending-sort integers (some may be negative):

a = np.array([ 1,  1,  1,  1, 10, 10, 20, 20, 20, 30, 40, 40, 40, 40])

I want to turn it into this:

a = np.array([ 1,  2,  3,  4, 10, 11, 20, 21, 22, 30, 40, 41, 42, 43])

...where each integer in each group of the same integers gets incremented, so for the first 1's:

  1 1 1 1  <--- these are the numbers from the array
  0 1 2 3  <--- these are counts of the number for its group
  -------
  1 2 3 4

Is there a more efficient way to do this than the below?

a = np.array([ 1,  1,  1,  1, 10, 10, 20, 20, 20, 30, 40, 40, 40, 40])
ones = (a == np.pad(a, (1,0))[:-1]).astype(int)
ones[ones == 0] = -np.diff(np.concatenate(([0.], np.cumsum(ones != 0)[ones == 0])))
new_a = a   ones.cumsum()

Note that array will always be in ascending order (lowest to highest), and the numbers will always be integers, and some may be negative.


Explanation, if you don't understand:

I actually already got this working, with the help of this post. What I'm doing right now is generating an array like this, where 0 marks the first of a group of identical numbers and 1 marks the rest:

1  1  1  1 10 10 20 20 20 30 40 40 40 40
0  1  1  1  0  1  0  1  1  0  0  1  1  1
^ first 1   ^ first 10     ^ first 30
                  ^ first 20  ^ first 40

and then using the technique from the above-linked post to cumsum all the ones in that array:

# Shift `a` by one and compare it with the original array
>>> ones = (a == np.pad(a, (1,0))[:-1]).astype(int)
>>> ones
array([0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1])

# This line is from the linked post (modified, of course)
>>> ones[ones == 0] = -np.diff(np.concatenate(([0.], np.cumsum(ones != 0)[ones == 0])))
>>> ones
array([ 0,  1,  1,  1, -3,  1, -1,  1,  1, -2,  0,  1,  1,  1])

>>> ones.cumsum()
array([0, 1, 2, 3, 0, 1, 0, 1, 2, 0, 0, 1, 2, 3])

Now, we can add that resulting array to the original one:

>>> a
array([ 1,  1,  1,  1, 10, 10, 20, 20, 20, 30, 40, 40, 40, 40])

>>> a   ones.cumsum()
array([ 1,  2,  3,  4, 10, 11, 20, 21, 22, 30, 40, 41, 42, 43])

CodePudding user response:

Using np.unique might be a bit more elegant:

u, i = np.unique(a, return_index=True)   # Indices where the sums restart
b = np.ones_like(a)
b[i] = u
b[i[1:]] -= np.add.reduceat(b, i)[:-1]   # Subtract the sum of the prior region from the next
result = b.cumsum()

Since the array is already sorted, you can shortcut to that part of np.unique:

i = np.r_[0, np.flatnonzero(np.diff(a))   1]  # Get the indices directly from the diff
b = np.ones_like(a)
b[i] = a[i]
b[i[1:]] -= np.add.reduceat(b, i)[:-1]
result = b.cumsum()

But wait, the sum of each region is just the length plus the start value minus one. That eliminates the need to sum twice:

i = np.r_[0, np.flatnonzero(np.diff(a))   1]
b = np.ones_like(a)
b[i] = a[i]
b[i[1:]] -= np.diff(i)   a[i[:-1]] - 1  # Simpler way to sum the prior region
result = b.cumsum()

You can simplify just a little further. Given that a[i[k]] is the start of a run, a[i[k] - 1] is the same as a[i[k - 1]]. In other words, the start of the previous run is the same as the last element in the previous run:

d = np.diff(a)
i = np.r_[0, np.flatnonzero(d)   1]
b = np.ones_like(a)
b[0] = a[0]
b[i[1:]] = d[i[1:] - 1] - np.diff(i)   1 # Current region minus prior, reusing diff
result = b.cumsum()

Either of the last two versions should be better then what you're currently doing.

The code above is written for simplicity and speed. If you want to make it shorter and more illegible, and you are using Python 3.8 , you can start throwing around the walrus operator:

i = np.r_[0, np.flatnonzero(d := np.diff(a))   1]
(b := np.ones_like(a))[0] = a[0]
b[i[1:]] = d[i[1:] - 1] - np.diff(i)   1
result = b.cumsum()

Since walrus evaluates left-to-right, you can create one final travesty:

(b := np.ones_like(a))[0] = a[0]
b[(i := np.r_[0, np.flatnonzero(d := np.diff(a))   1])[1:]] = d[i[1:] - 1] - np.diff(i)   1
result = b.cumsum()

Similar for the other approach:

(b := np.ones_like(a))[i := np.r_[0, np.flatnonzero(np.diff(a))   1]] = a[i]
b[i[1:]] -= np.diff(i)   a[i[:-1]] - 1
result = b.cumsum()

CodePudding user response:

I am not sure if this is very efficient, but it is a one-liner:

np.hstack([x   np.r_[:x.size] for x in np.split(a, np.flatnonzero(np.diff(a)) 1)])
  • Related