When passing a numpy.ndarray
of uint8
to numpy.logical_and
, it runs significantly faster if I apply numpy.view(bool)
to its inputs.
a = np.random.randint(0, 255, 1000 * 1000 * 100, dtype=np.uint8)
b = np.random.randint(0, 255, 1000 * 1000 * 100, dtype=np.uint8)
%timeit np.logical_and(a, b)
126 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.logical_and(a.view(bool), b.view(bool))
20.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Can someone explain why this is happening?
Furthermore, why numpy.logical_and
doesn't automatically apply view(bool)
to an array of uint8
? (Is there any situation where we shouldn't use view(bool)
?)
EDIT:
It seems that this is an issue with Windows environment. I just tried the same thing in the official python docker container (which is debian) and found no difference between them.
My environment:
- OS: Windows 10 Pro 21H2
- CPU: AMD Ryzen 9 5900X
- Python: Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32
- numpy: 1.22.2
CodePudding user response:
This is a performance issue of the current Numpy implementation. I can also reproduce this problem on Windows (using an Intel Skylake Xeon processor with Numpy 1.20.3). np.logical_and(a, b)
executes a very-inefficient scalar assembly code based on slow conditional jumps while np.logical_and(a.view(bool), b.view(bool))
executes relatively-fast SIMD instructions.
Currently, Numpy uses a specific implementation for bool
-types. Regarding the compiler used, the general-purpose implementation can be significantly slower if the compiler used to build Numpy failed to automatically vectorize the code which is apparently the case on Windows (and explain why this is not the case on other platforms since the compiler is likely not exactly the same). The Numpy code can be improved for non-bool
types. Note that the vectorization of Numpy is an ongoing work and we plan optimize this soon.
Deeper analysis
Here is the assembly code executed by np.logical_and(a, b)
:
Block 24:
cmp byte ptr [r8], 0x0 ; Read a[i]
jz <Block 27> ; Jump to block 27 if a[i]!=0
Block 25:
cmp byte ptr [r9], 0x0 ; Read b[i]
jz <Block 27> ; Jump to block 27 if b[i]!=0
Block 26:
mov al, 0x1 ; al = 1
jmp <Block 28> ; Skip the next instruction
Block 27:
xor al, al ; al = 0
Block 28:
mov byte ptr [rdx], al ; result[i] = al
inc r8 ; i = 1
inc rdx
inc r9
sub rcx, 0x1
jnz <Block 24> ; Loop again while i<a.shape[0]
As you can see, the loop use several data-dependent conditional jumps to write per item of a
and b
read. This is very inefficient here since the branch taken cannot be predicted by the processor with random values. As a result the processor stall for few cycles (typically about 10 cycles on modern x86 processors).
Here is the assembly code executed by np.logical_and(a.view(bool), b.view(bool))
:
Block 15:
movdqu xmm1, xmmword ptr [r10] ; xmm1 = a[i:i 16]
movdqu xmm0, xmmword ptr [rbx r10*1] ; xmm0 = b[i:i 16]
lea r10, ptr [r10 0x10] ; i = 16
pcmpeqb xmm1, xmm2 ; \
pandn xmm1, xmm0 ; | Complex sequence to just do:
pcmpeqb xmm1, xmm2 ; | xmm1 &= xmm0
pandn xmm1, xmm3 ; /
movdqu xmmword ptr [r14 r10*1-0x10], xmm1 ; result[i:i 16] = xmm1
sub rcx, 0x1
jnz <Block 15> ; Loop again while i!=a.shape[0]//16
This code use the SIMD instruction set called SSE which is able to work on 128-bit wide registers. There is no conditional jumps. This code is far more efficient as it operates on 16 items at once per iteration and each iteration should be much faster.
Note that this last code is not optimal either as most modern x86 processors (like your AMD one) supports the 256-bit AVX-2 instruction set (twice as fast). Moreover, the compiler generate an inefficient sequence of SIMD instruction to perform the logical-and that can be optimized. The compiler seems to assume the boolean can be values different of 0 or 1. That being said, the input arrays are too big to fit in your CPU cache and so the code is bounded by the throughput of your RAM as opposed to the first one. This is why the SIMD-friendly code is not drastically faster. The difference between the two version is certainly much bigger with arrays of less than 1 MiB on your processor (like on almost all other modern processor).