Faster numpy array indexing when using condition (numpy.where)?-CodePudding

I have a huge numpy array with shape (50000000, 3) and I'm using:

x = array[np.where((array[:,0] == value) | (array[:,1] == value))]

to get the part of the array that I want. But this way seems to be quite slow. Is there a more efficient way of performing the same task with numpy?

CodePudding user response：

np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.

One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).

Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.

One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:

# Compilation for the most frequent types. 
# Please pick the right ones so to speed up the compilation time. 
@nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
    n = array.shape[0]
    mask = np.empty(n, dtype=np.bool_)
    for i in nb.prange(n):
        mask[i] = array[i, 0] == value or array[i, 1] == value
    return mask

x = array[select(array, value)]

Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.