numpy argsort slow performance-CodePudding

My sincere apologies, in advance if this question seems quite basic.

Given:

>>> import numpy as np
>>> import time
>>> A = np.random.rand( int(1e5), int(5e4) ) # large numpy array

Goal:

>>> bt=time.time(); B=np.argsort(A,axis=1);et=time.time();print(f"Took {(et-bt):.2f} s")

However, it takes quite long time to calculate an array of indices:

# Took 316.90 s

Question:

Is there any other time efficient ways to do this?

Cheers,

CodePudding user response：

The input array A has a shape of (100_000, 50_000) and contains np.float64 values by default. This means that you need 8 * 100_000 * 50_000 / 1024**3 = 37.2 Gio of memory just for this array. You also likely need the same amount of space for the output matrix B (which should contains items of type np.int64). This means you need a machine with at least 74.4 Gio, not to mentions the space required for the operating system (OS) and running software (so probably at least 80 Gio). If you do not have such a memory space, then your OS will use your storage device as a swap memory, which is much much slower.

Assuming you have such a memory space available, such a computation is very expensive. It is mainly due to the page faults when filling the B array, and also the fact that B is pretty huge as well as the default implementation of Numpy do the computation sequentially. You can speed the computation up using a parallel Numba code and a smaller output. Here is an example:

import numba as nb

@nb.njit('int32[:,::1](float64[:,::1])', parallel=True)
def fastSort(a):
    b = np.empty(a.shape, dtype=np.int32)
    for i in nb.prange(a.shape[0]):
        b[i,:] = np.argsort(a[i,:])
    return b

Note that the Numba's implementation of argsort is less efficient than the one of Numpy but the parallel version should be much faster if the target machine have mny cores and a good memory bandwidth.

Here are the results on my 6-core machine on a matrix of size (10_000, 50_000) (10 times smaller because I do not have 80 Gio of RAM):

Original implementation: 28.14 s
Sequential Numba:        38.70 s
Parallel Numba:           6.79 s

The resulting solution is thus 4.1 times faster.

Note that you could even use the type uint16 for the items of B in this specific case as the size of each line is less than 2**16 = 65536. This will likely not be significantly faster but it should save some additional memory. The resulting required memory will be 46.5 Gio. You can further reduce the amount of memory needed using the np.float32 type (often at the expense of a loss of accuracy).

If you want to improve the execution time further, then you need to implement a faster implementation of argsort for your specific needs using a low-level language like C or C . But be aware that beating Numpy is far from being easy if you are not an experienced programmer in such a language or not familiar with low-level optimizations. If you are interested in such a solution, a good start is probably to implement a radix sort.