We know that numpy is C order stored so .sum(axis=1) should be faster that .sum(axis=0). But I find that
arr = np.ones([64, 64])
arr.flags
%timeit arr.sum(axis=1) # 4.15 µs ± 97.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.sum(axis=0) # 4.67 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But when the size change to 10000
arr = np.ones([10000, 10000]) #
arr.flags
%timeit arr.sum(axis=1) # 109 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.sum(axis=0) # 57.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
First of all, I cannot reproduce the effect with the last version of Numpy (dev) nor the version 1.21.5. I got respectively 30.5 ms and 36.5 ms (so the opposite behaviour). Thus, the result are likely dependent of the hardware.
On one hand, arr.sum(axis=1)
use a pairwise sum algorithm. This algorithm is not so simple to implement efficiently. Numpy unroll the the leaf computation so to speed up the algorithm but it is not manually vectorized. In fact, the compiler (GCC) also fail to automatically vectorize it because is it far from being trivial. Numpy also take benefit from prefetching instructions so to speed the computation up and make it more cache friendly. That being said, this is not very useful on modern processors. The cache accesses are not prefect but the L1 cache and cache-lines reduce the overhead of the non contiguous accesses.
On the other hand, arr.sum(axis=0)
is fully manually vectorized. This is the case because it is easier to vectorize operation vertically (eg. array array) than horizontally (eg. reduction). On recent x86-64 architecture with AVX/AVX-2, it should make use of this SIMD instruction set which is capable of computing 4 double at a time. That being said, the operation is not unrolled and the result is accumulated in an array that can be bigger than the L1 cache. Indeed, the reduction is done row by row and each row takes 80 KB while for example my i5-9600KF processor has a L1 cache of 32 KB. This means data should be constantly reloaded from the L2 cache in this case.
Regarding the target architecture of the processor, the cache size and the memory, results can be different. Theoretically, nothing prevent Numpy to implement a code saturating the memory throughput, but as of now, the former use-case is more complex to optimize. Since the complexity of the Numpy code is important and the operation tends to be quite memory-bound on most machines (at least for a given core), we did not optimize the former case yet.
Related posts: