No speedup when summing uint16 vs uint64 arrays with NumPy?-CodePudding

I have to do a large number of operations (additions) on relatively small integers, and I started considering which datatype would give the best performance on a 64 bit machine.

I was convinced that adding together 4 uint16 would take the same time as one uint64, since the ALU could make 4 uint16 additions using only 1 uint64 adder. (Carry propagation means this doesn't work that easily for a single 64-bit adder, but this is how integer SIMD instructions work.)

Apparently this is not the case:

In [3]: data = np.random.rand(10000)

In [4]: int16 = data.astype(np.uint16)

In [5]: int64 = data.astype(np.uint64)

In [6]: int32 = data.astype(np.uint32)

In [7]: float32 = data.astype(np.float32)

In [8]: float64 = data.astype(np.float64)

In [9]: %timeit int16.sum()
13.4 µs ± 43.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit int32.sum()
13.9 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [11]: %timeit int64.sum()
9.33 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit float32.sum()
5.79 µs ± 6.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit float64.sum()
6 µs ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

All int operations take the same time (more than the float ops) with no speedup. Is this due the C implementation of numpy not being fully optimized or is there some hardware limitation that I am not aware of?

CodePudding user response：

TL;DR: I made an experimental analysis on Numpy 1.21.1. Experimental results show that np.sum does NOT (really) make use of SIMD instructions: no SIMD instruction are used for integers, and scalar SIMD instructions are used for floating-point numbers! Moreover, Numpy converts the integers to 64-bits values for smaller integer types by default so to avoid overflows!

Note that this may not reflect all Numpy versions since there is an ongoing work to provide SIMD support for commonly used functions (the version Numpy 1.22.0rc1 not yet released continue this long-standing work). Moreover, the compiler or the processor used may significantly impact the results. The following experiments have been done using a Numpy retrieved from pip on a Debian Linux with a i5-9600KF processor.

Under the hood of `np.sum`

For floating-point numbers, Numpy uses a pairwise algorithm which is known to be quite numerically stable while being relatively fast. This can be seen in the code, but also simply using a profiler: TYPE_pairwise_sum is the C function called to compute the sum at runtime (where TYPE is DOUBLE or FLOAT).

For integers, Numpy use a classical naive reduction. The C function called is ULONG_add_avx2 on AVX2-compatible machines. It also surprisingly convert items to 64-bit ones if the type is not np.int64.

Here is the hot part of the assembly code executed by the DOUBLE_pairwise_sum function

  3,65 │ a0:┌─→add        $0x8,%rcx                  ; Loop iterator
  3,60 │    │  prefetcht0 (%r8,%rax,1)               ; prefetch data
  9,46 │    │  vaddsd     (%rax,%rbp,1),%xmm1,%xmm1  ; Accumulate an item in xmm1[0]
  4,65 │    │  vaddsd     (%rax),%xmm0,%xmm0         ; Same for xmm0[0]
  6,91 │    │  vaddsd     (%rax,%rdx,1),%xmm4,%xmm4  ; Etc.
  7,77 │    │  vaddsd     (%rax,%rdx,2),%xmm7,%xmm7
  7,41 │    │  vaddsd     (%rax,%r10,1),%xmm2,%xmm2
  7,27 │    │  vaddsd     (%rax,%rdx,4),%xmm6,%xmm6
  6,80 │    │  vaddsd     (%rax,%r11,1),%xmm3,%xmm3
  7,46 │    │  vaddsd     (%rax,%rbx,1),%xmm5,%xmm5
  3,46 │    │  add        %r12,%rax                  ; Switch to the next items (x8)
  0,13 │    ├──cmp        %rcx,%r9                   ; Should the loop continue?
  3,27 │    └──jg         a0                         ; Jump to the beginning if so

The Numpy compiled code clearly uses the scalar SIMD instruction vaddsd (computing only a single double-precision item) although it successfully unroll the loop 8 times. The same code is generated for FLOAT_pairwise_sum: vaddss is called 8 times.

For the np.uint32, here is the hot part of the generated assembly code:

  2,37 │160:┌─→add          $0x1,%rax     ; loop iterator
 95,95 │    │  add          (%rdi),%rdx   ; Accumulate the values in %rdx
  0,06 │    │  add          %r10,%rdi     ; switch to the next item
       │    ├──cmp          %rsi,%rax     ; Should the loop
  1,08 │    └──jne          160           ; Jump to the beginning if so

Numpy obviously does not use SIMD instructions for np.uint32 type. It does not even unroll the loop. The add (%rdi),%rdx instruction performing the accumulation is the bottleneck here due to the data dependency on the accumulator. The same loops can be seen for np.uint64(despite the name of the function isULONG_add_avx2`).

However, the np.uint32 version call the C function _aligned_contig_cast_uint_to_ulong in order to convert the integer items to a wider type. Numpy does that to avoid integer overflows. The same thing can be seen for the types np.uint8 and np.uint16 (although the name of the function differs). Hopefully, this function makes use of SIMD instructions (SSE) but still takes a significant portion of the execution time (~30% of the np.sum time).

EDIT: as pointed out by @user2357112supportsMonica, the dtype parameter of np.sum can be explicitly specified. When it match with the dtype of the input array, then the conversion is not performed. This results in a smaller execution time at the expense of a possible overflow.

Benchmark results

Here is the result on my machine:

uint16: 7.17 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint32: 7.11 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
uint64: 5.05 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float32: 2.88 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)
float64: 3.06 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 20000 loops each)

First of all, not that the results are very similar to the ones provided in the question meaning that the behavior seen on my machine can be successfully reproduced on other one. Thus, the explanation should also be consistent.

As you can see, the 64-bit version is faster than the other integer-based versions. This is due to the overhead of the conversion. The two first are equally fast because of the scalar loop and the add instruction being equally-fast for 8-bit, 16-bit and 32-bit integers (this should be true for most 64-bit mainstream platforms). The integer implementations are slower than the floating-point ones because of the lack of (proper) loop unrolling.

The floating-point implementations are equally-fast due to the scalar SIMD instructions. Indeed, the instructions vaddss (for np.float32) and vaddsd (for np.float64) have the same latency and throughput (at least on all modern Intel processors). Thus the throughput of the two implementation is the same since the loop of the two implementation is similar (same unrolling).

Additional notes

This is sad np.sum does not fully make use of SIMD instructions as this would speed up a lot the computations using it (especially small integers). However, several points make it hard for both compilers and developers to use SIMD instructions:

the type of the accumulator is not the same as the one of the items (cf. overflows);
the floating-point addition is not commutative, nor associative (unless flags like -ffast-math are used);
SIMD instructions are not portable (the cost, the length and the capabilities differ between architectures).

Under the hood of np.sum

Benchmark results

Additional notes

Under the hood of `np.sum`