Home > Back-end >  numpy cumulative count in linear time
numpy cumulative count in linear time

Time:07-16

If I have a data like this: [144 144 144 144 143 143 143 93 93 93 93 93 93 93 93 93 93] And I want to make data like this: [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, ....]

But, I tried to use the function in the link below, I got this: [3 2 1 0 2 1 0 9 0 6 5 4 3 2 1 7 8]

How can I fixed it?

def grp_range(a):
count = np.unique(a,return_counts=1)[1]
idx = count.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -count[:-1] 1
out = id_arr.cumsum()[np.argsort(a).argsort()]

return out

How to use numpy to get the cumulative count by unique values in linear time?

CodePudding user response:

To speed up bpfrd version you can use numba. On my machine ~40 times faster as bpfrd version and ~10 times faster to ouroboros1 versions!

from numba import jit
import numpy as np

@jit(nopython=True)
def numba_style(a):
    prev = a[0]
    idx = 0
    c = [idx]

    for i in range(1, len(a)):
        if a[i] == prev:
            idx  = 1
        else:
            prev = a[i]
            idx = 0
        c.append(idx) 

    return np.array(c)

Function Timing (mean ± std. dev. of 3 runs, 3 loops each)
list_style 818 µs ± 116 µs per loop
grp_range 230 µs ± 81.3 µs per loop
cumcount 170 µs ± 73.9 µs per loop
np_style_2 165 µs ± 86.2 µs per loop
np_style 118 µs ± 79.3 µs per loop
numba_style 19.2 µs ± 887 ns per loop

Performance Testing

Speed comparison also to ouroboros1 versions. Define functions:

def list_style(a):
    prev = a[0]
    idx = 0
    c = [idx]

    for i in range(1, len(a)):
        if a[i] == prev:
            idx  = 1
        else:
            prev = a[i]
            idx = 0
        c.append(idx) 

    return c

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0]   1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

def grp_range(a):
    count = np.unique(a,return_counts=1)[1]
    idx = count.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -count[:-1] 1
    out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()] # adjustment here
    return out

def np_style(a):
    unique, index, counts = np.unique(a, return_counts=True, return_index=True)
    arg_s = np.argsort(index)
    return np.concatenate(list(map(np.arange,counts[arg_s])), axis=0)

def np_style_2(a):
    aa = a np.arange(len(a))
    unique, index = np.unique(a, return_index=True)
    for uni, ind in zip(unique, index):
        aa[a==uni] -= aa[ind]
    return aa

And the actual testing with a slightly longer array:

n =   4,   3,  9, 15, 24, 100, 2500, 555
t = 144, 143, 93, 85, 84, 100, 250, 555
a = np.concatenate([[t_i]*n_i for t_i, n_i in zip(t,n)])

%timeit -n 3 -r 3 grp_range(a)
%timeit -n 3 -r 3 np_style(a)
%timeit -n 3 -r 3 np_style_2(a)
%timeit -n 3 -r 3 cumcount(a)
%timeit -n 3 -r 3 list_style(a)
%timeit -n 3 -r 3 numba_style(a)

CodePudding user response:

try this:

a = [144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93]
prev = a[0]
idx = 0
c = [idx]

for i in range(1, len(a)):
    if a[i] == prev:
        idx  = 1
    else:
        prev = a[i]
        idx = 0
    c.append(idx) 

c
>[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

CodePudding user response:

TL;DR: within your function change

out = id_arr.cumsum()[np.argsort(a).argsort()]

into

out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()]

If speed is a concern, use the solution offered by @piRSquared in the post mentioned. You'll need the first three functions mentioned. So:

import numpy as np

arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0]   1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

cumcount(arr) # will get you desired output

The accepted answer from the referenced post is actually incorrect.

The problem lies with the fact that np.argsort uses quicksort as the default sorting algorithm. For a stable sort, we need mergesort (see the comments by @MartijnPieters on the matter here).

So, in your slightly adjusted function we need:

import numpy as np

def grp_range(a):
    count = np.unique(a,return_counts=1)[1]
    idx = count.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -count[:-1] 1
    out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()] # adjustment here
    return out

Testing a couple of examples:

# OP's example
arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

arr_result = grp_range(arr)
print(arr_result)
# [0 1 2 3 0 1 2 0 1 2 3 4 5 6 7 8 9] (correct)

# OP's example, with mixed sequence (note addition 143, 144 at arr1[9:11])
arr_alt = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 143, 144, 93, 93, 93, 93, 93, 93])

arr_alt_result = grp_range(arr_alt)
print(arr_alt_result)
# [0 1 2 3 0 1 2 0 1 3 4 2 3 4 5 6 7] (correct) (note: arr_alt_result[9:11] == array([3, 4], dtype=int32))

As mentioned above, the solution offered by @piRSquared will be faster than this with the same results.


A final aside. The sequence posted is in descending order. If this is true for the actual data you are working with, you could do something like this:

import numpy as np

arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

count = np.unique(arr,return_counts=1)[1][::-1] # from ascending to descending
out = np.concatenate(list(map(np.arange,count)), axis=0)

# out: [0 1 2 3 0 1 2 0 1 2 3 4 5 6 7 8 9]

or this:

from collections import Counter

arr = [144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93]
count_dict = Counter(arr)

out = list()
for v in count_dict.items():
    out.extend([*range(v[1])])

or indeed, use the answer provided by @bpfrd.

  • Related