Numba parallelization with prange is slower when used more threads-CodePudding

I tried a simple code to parallelize a loop with numba and prange. But for some reason when I use more threads instead of going faster it gets slower. Why is this happening? (cpu ryzen 7 2700x 8 cores 16 threads 3.7GHz)

from numba import njit, prange,set_num_threads,get_num_threads
@njit(parallel=True,fastmath=True)
def test1():
    x=np.empty((10,10))
    for i in prange(10):
        for j in range(10):
            x[i,j]=i j

Number of threads : 1
897 ns ± 18.3 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 2
1.68 µs ± 262 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 3
2.4 µs ± 163 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 4
4.12 µs ± 294 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 5
4.62 µs ± 283 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 6
5.01 µs ± 145 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 7
5.52 µs ± 194 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 8
4.85 µs ± 140 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 9
6.47 µs ± 348 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 10
6.88 µs ± 120 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 11
7.1 µs ± 154 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 12
7.47 µs ± 159 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 13
7.91 µs ± 160 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 14
9.04 µs ± 472 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 15
9.74 µs ± 581 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 16
11 µs ± 967 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)

CodePudding user response：

This is totally normal. Numba needs to create threads and distribute the work between them so they can execute the computation in parallel. Numba can use different threading backends. The default if generally OpenMP and the default OpenMP implementation should be IOMP (OpenMP runtime of ICC/Clang) which try to create threads only once. Still, sharing the work between threads is far slower than iterating over 100 values. A modern mainstream processor should be able to execute the 2 nested loops in sequential in less than 0.1-0.2 us. Numba should also be able to unroll the two loops. The Numba function overhead is also generally about few hundreds of nanoseconds. The allocation of the Numpy array should be far slower than the actual loops. Additionally, there are other overheads causing this code to be significantly slower with multiple threads even if the previous overhead would be negligible. For example, false-sharing causes the writes to be mostly serialized and thus slower than if they would be done one 1 unique threads (because of a cache line bouncing effect operating on the LLC on x86-64 platforms).

Note that the time to create a thread is generally significantly more than 1 us because a system call is required.

Put it shortly: use threads when the work to do is big enough and can be efficiently parallelized.