Julia: parallelise function of vectors-CodePudding

this is a simplified version of my code:

function simula(v)
    a = zeros(length(v))
    for i in 1:1000000
        a -= v
        a[1] = 3.

        a = a.^2 
    end
end

v = ones(100)
_ = simula(v) #compile
@time Threads.@threads for i in 1:10
    simula(v)
end

If I run this code with 1,2 or 4 threads nothing changes (in terms of time). Do you know why? My intuition is that the operations that involves vectors already use multithreading. Is there a way of parallelising it? My goal is to run the function simula on a cluster with 100 of cores. Thanks in advance.

CodePudding user response：

By default there is only 1 thread used. You can see that using the task manager of your operating system. You need to set the JULIA_NUM_THREADS environment variable first (set it to the number of cores). On Windows, you can do that with the default batch interpreter using set JULIA_NUM_THREADS=4 (for 4 threads). The being said, the program scale poorly.

The main reason is do not scale due to the Garbage Collector (GC) since it is put under pressure. Indeed, the serial program results in a 22% GC time while the parallel version using 4 threads results in a 45% GC time (with a 6 core machine). This is a lot and this indicate too many allocations. You can rewrite your code using basic plain loops and it is then faster in sequential and also scale much better:

function simula(v)
    a = zeros(length(v))
    for i in 1:1000000
        for j in 1:length(a)
            a[j] -= v[j]
        end

        a[1] = 3.

        for j in 1:length(a)
            a[j] *= a[j]
        end
    end
end

A simpler implementation is to use the .-based operators as pointed out by @OscarSmith:

function simula(v)
    a = zeros(length(v))
    for i in 1:1000000
        a .-= v
        a[1] = 3.
        a .= a.^2
    end
end

The serial version is about 15 times faster than before and the parallel version using 4 threads is about 40 times faster than before. With 6 threads, it is about 50 times faster than before. The parallel version is 3.7 times faster than the serial one. This is not too bad regarding the small number of iterations.

Note that the program do nothing and compilers can often optimize out codes doing nothing (ie. remove such parts from the program). You should always benchmark code with visible effects so to avoid biases in the benchmark (and misleading conclusions).