Python: get maximum occurrence in array-CodePudding

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.

numba implementation

import numba as nb
import numpy as np
import collections

@nb.njit("int64(int64[:])")
def max_count_unique_num(x):
    """
    Counts maximum number of unique integer in x.

    Args:
        x (numpy array): Integer array.

    Returns:
        Int

    """
    # get maximum value
    m = x[0]
    for v in x:
        if v > m:
            m = v

    if m == 0:
        return x.size

    # count each unique value
    num = np.zeros(m   1, dtype=x.dtype)
    for k in x:
        num[k]  = 1
    # maximum count
    m = 0
    for k in num:
        if k > m:
            m = k
    return m

For comparisons, I also implemented numpy's unique and collections.Counter

def np_unique(x):
    """ Counts maximum occurrence using numpy's unique. """
    ux, uc = np.unique(x, return_counts=True)
    return uc.max()


def counter(x):
    """ Counts maximum occurrence using collections.Counter. """
    counts = collections.Counter(x)
    return max(counts.values())

timeit

Edit: Add np.bincount for additional comparison, as suggested by @MechanicPig.

In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.

Any way I can generalize the implementation and keep the speed?

CodePudding user response：

You can do that without using numpy too.

arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]

If you want to use numpy then try this,

arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]

Both do the exact same job. To check which method is more efficient just do this,

time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method. 
for i in range(10**5):
  <method you want>
time_f = time.time()

When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

CodePudding user response：

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:

def encode(ar):
    # Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
    flatten = ar.ravel()
    perm = flatten.argsort()
    sort = flatten[perm]

    mask = np.concatenate(([False], sort[1:] != sort[:-1]))
    encoded = np.empty(sort.shape, np.int64)
    encoded[perm] = mask.cumsum()
    encoded.shape = ar.shape

    return encoded

def count_max(ar):
    return max_count_unique_num(encode(ar))

CodePudding user response：

What if shortening your code:

@nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
    num = np.zeros(x.max()   1, dtype=x.dtype)
    for k in x:
        num[k]  = 1

    return num.max()

or paralleled:

@nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
    num = np.zeros(x.max()   1, dtype=x.dtype)
    for k in nb.prange(x.size):
        num[x[k]]  = 1

    return num.max()

Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.

For handling the floats (or negative values) using Numba:

@nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
    num = np.zeros(x.size, dtype=np.int8)
    for k in x:
        for j in range(x.shape[0]):
            if k == x[j]:
                num[j]  = 1

    return num.max()

IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.

Notes: np.bincount can handle just integers.