I've noticed that np.unique
might get slower in some cases if not passing True
to return_index
parameter.
a = np.ones(shape = (1000, 50), dtype=int)
a[:,-7:] = [10000, -4750, -4750, 95, 95, 95, 95]
arr = np.cumsum(a.ravel())
%timeit np.unique(arr)
%timeit np.unique(arr, return_index=True)
%timeit np.unique(arr, return_index=True, return_inverse=True)
%timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True)
1.14 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
711 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
955 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.3 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It doesn't occur usually with other kinds of data. What is happening here?
CodePudding user response:
The difference occurs because the data in your example is sorted. unique
sorts the data, and when return_index=True
, a stable merge-sort algorithn is used. When merge-sort is applied to data that is already sorted, the algorithm will make just one pass through the data, so it is very fast.
For example, in the follow, arr
is an array of nondecreasing values:
In [10]: arr = np.random.randint(0, 3, size=50000).cumsum()
In [11]: arr
Out[11]: array([ 1, 3, 4, ..., 49892, 49892, 49894])
The default sort algorithm takes almost 8 times as long as the merge-sort:
In [12]: %timeit np.sort(arr)
386 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: %timeit np.sort(arr, kind='mergesort')
49.5 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can see the code that ends up doing the actual work of finding the unique values here: https://github.com/numpy/numpy/blob/6ff787b93d46cca6d31c370cfd9543ed573a98fc/numpy/lib/arraysetops.py#L320-L361
CodePudding user response:
arr = np.round(np.random.rand(50000) * 50).astype('int')
%timeit np.unique(arr)
%timeit np.unique(arr, return_index=True)
%timeit np.unique(arr, return_index=True, return_inverse=True)
%timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True)
274 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
352 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
397 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
414 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Data in your array are all unique. Random data with duplicates didn't reproduce this behavior. My best guess is NumPy has some optimization for near-all-unique arrays when return_index is specified.