Understand multi-threading behavior in Julia-CodePudding

I'm trying to understand multi-threading behavior in Julia and I've noticed the following two blocks of code behave differently in Julia v1.6.3 (I'm running Julia in Atom in some script.jl):

acc = 0
Threads.@threads for i in 1:1000
         global acc
         println(Threads.threadid(), ",", acc)
         acc  = 1
      end
acc

and

acc = 0
Threads.@threads for i in 1:1000
         global acc
         acc  = 1
      end
acc

Notice the only difference is I took away "println(Threads.threadid(), ",", acc)" in the later case. As a result, the first block would give me 1000 every time I run it and the second block would give me some number <1000 (due to race condition).

I'm entirely new to Julia's parallel computing (or parallel computing in general), so would appreciate any explanation on what is happening here and why a single print line changes the behavior of the code block.

CodePudding user response：

You have several threads mutating the state acc at the same time and you end up with a race condition.

However, println takes relatively long compared to the addition operation and one println occurs in time and hence for small loops you have a good chance of observing a "correct" result. However, both your loops are incorrect.

When mutating a exactly the same shared state by many threads you need either to introduce locking or use an atomic variable.

For fast, short running loops use SpinLock as in:

julia> acc = 0;

julia> u = Threads.SpinLock();

julia> Threads.@threads for i in 1:1000
                global acc
                Threads.lock(u) do
                    acc  = 1
                end
             end

julia> acc
1000

The second option is ReentrantLock which is generally better for longer running loops (it takes much longer to switch than a SpinLock) with heterogenous times within loop steps (it does not take CPU time "spinning" like the SpinLock):

julia> acc = 0
0

julia> u = ReentrantLock();

julia> Threads.@threads for i in 1:1000
                global acc
                Threads.lock(u) do
                    acc  = 1
                end
             end

julia> acc
1000

If you are mutating a primitive value (like in your case) atomic operations will be the fastest (please note how I get the value from an Atomic):

julia> acc2 = Threads.Atomic{Int}(0)
Base.Threads.Atomic{Int64}(0)

julia> Threads.@threads for i in 1:1000
                global acc2
                Threads.atomic_add!(acc2, 1)
             end

julia> acc2[]
1000

CodePudding user response：

You probably know this, but in real life all of this should be in a function; your performance will be disastrous if you use a global variable, and with a function you'd be miles ahead with just a single-threaded implementation. While users of "slow" programming languages often reach for parallelism immediately to speed performance, with Julia usually your best approach is to first analyze the performance of a single-threaded implementation (using tools like the profiler) and fix any problems you discover. Especially for newcomers to Julia, it's not uncommon to make your code ten- or a hundred-fold faster this way, and in such cases you might feel that's all you need.

Indeed, sometimes the single-threaded implementation will be faster because threading introduces its own overhead. We can illustrate that easily here. I'm going to make one modification to your code above: instead of adding 1 on each iteration, I'm going to add i % 2, which adds 1 if i is odd and 0 if i is even. I'm doing that because once you put this in a function, if all you do is add 1, Julia's compilation is smart enough to figure out what you're doing and just return the answer without actually running the loop; we want to run the loop so we have to make it a bit trickier so the compiler can't figure out the answer ahead of time.

First, let's try the fastest of the threaded implementations above (I started Julia with julia -t4 to use 4 threads):

julia> acc2 = Threads.Atomic{Int}(0)
Base.Threads.Atomic{Int64}(0)

julia> @btime Threads.@threads for i in 1:1000
           global acc2
           Threads.atomic_add!(acc2, i % 2)
       end
  12.983 μs (21 allocations: 1.86 KiB)

julia> @btime Threads.@threads for i in 1:1000000
           global acc2
           Threads.atomic_add!(acc2, i % 2)
       end
  27.532 ms (22 allocations: 1.89 KiB)

Is this fast or slow? Let's first put this in a function and see if it helps:

julia> function lockadd(n)
           acc = Threads.Atomic{Int}(0)
           Threads.@threads for i = 1:n
               Threads.atomic_add!(acc, i % 2)
           end
           return acc[]
       end
lockadd (generic function with 1 method)

julia> @btime lockadd(1000)
  9.737 μs (22 allocations: 1.88 KiB)
500

julia> @btime lockadd(1000000)
  13.356 ms (22 allocations: 1.88 KiB)
500000

So we've gained a factor of 2 (on the bigger job) by putting it in a function. However, an even better threading strategy is lock-free threading: give each thread its own acc and then add all the separate accs at the end.

julia> function threadedadd(n)
           accs = zeros(Int, Threads.nthreads())
           Threads.@threads for i = 1:n
               accs[Threads.threadid()]  = i % 2
           end
           return sum(accs)
       end
threadedadd (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime threadedadd(1000)
  2.967 μs (22 allocations: 1.97 KiB)
500

julia> @btime threadedadd(1000000)
  56.852 μs (22 allocations: 1.97 KiB)
500000

For the longer loop, we've gained more than 200x performance! That's a very nice speedup indeed.

However, let's try a simple single-threaded implementation:

julia> function addacc(n)
           acc = 0
           for i in 1:n
               acc  = i % 2
           end
           return acc
       end
addacc (generic function with 1 method)

julia> @btime addacc(1000)
  43.218 ns (0 allocations: 0 bytes)
500

julia> @btime addacc(1000000)
  41.068 μs (0 allocations: 0 bytes)
500000

This is 70x faster than the threaded implementation on the small job, and is faster even on the larger one. For completeness, let's compare that to the same code that uses global state:

julia> @btime for i in 1:1000
           global acc
           acc  = i % 2
       end
  20.158 μs (1000 allocations: 15.62 KiB)

julia> @btime for i in 1:1000000
           global acc
           acc  = i % 2
       end
  20.455 ms (1000000 allocations: 15.26 MiB)

Horrendous.

There are, of course, cases where parallelism makes a difference, but it's typically for much more complicated tasks. You still shouldn't use it unless you've already optimized a single-threaded implementation.

So the two important morals of the story:

read Julia's performance tips, analyze the performance of your code, and fix any bottlenecks
reach for parallelism only after you've exhausted all single-threaded options.