I'm trying to understand multi-threading behavior in Julia and I've noticed the following two blocks of code behave differently in Julia v1.6.3 (I'm running Julia in Atom in some script.jl):
acc = 0
Threads.@threads for i in 1:1000
global acc
println(Threads.threadid(), ",", acc)
acc = 1
end
acc
and
acc = 0
Threads.@threads for i in 1:1000
global acc
acc = 1
end
acc
Notice the only difference is I took away "println(Threads.threadid(), ",", acc)" in the later case. As a result, the first block would give me 1000 every time I run it and the second block would give me some number <1000 (due to race condition).
I'm entirely new to Julia's parallel computing (or parallel computing in general), so would appreciate any explanation on what is happening here and why a single print line changes the behavior of the code block.
CodePudding user response:
You have several threads mutating the state acc
at the same time and you end up with a race condition.
However, println
takes relatively long compared to the addition operation and one println
occurs in time and hence for small loops you have a good chance of observing a "correct" result. However, both your loops are incorrect.
When mutating a exactly the same shared state by many threads you need either to introduce locking or use an atomic variable.
- For fast, short running loops use
SpinLock
as in:
julia> acc = 0;
julia> u = Threads.SpinLock();
julia> Threads.@threads for i in 1:1000
global acc
Threads.lock(u) do
acc = 1
end
end
julia> acc
1000
- The second option is
ReentrantLock
which is generally better for longer running loops (it takes much longer to switch than aSpinLock
) with heterogenous times within loop steps (it does not take CPU time "spinning" like theSpinLock
):
julia> acc = 0
0
julia> u = ReentrantLock();
julia> Threads.@threads for i in 1:1000
global acc
Threads.lock(u) do
acc = 1
end
end
julia> acc
1000
- If you are mutating a primitive value (like in your case) atomic operations will be the fastest (please note how I get the value from an
Atomic
):
julia> acc2 = Threads.Atomic{Int}(0)
Base.Threads.Atomic{Int64}(0)
julia> Threads.@threads for i in 1:1000
global acc2
Threads.atomic_add!(acc2, 1)
end
julia> acc2[]
1000
CodePudding user response:
You probably know this, but in real life all of this should be in a function; your performance will be disastrous if you use a global variable, and with a function you'd be miles ahead with just a single-threaded implementation. While users of "slow" programming languages often reach for parallelism immediately to speed performance, with Julia usually your best approach is to first analyze the performance of a single-threaded implementation (using tools like the profiler) and fix any problems you discover. Especially for newcomers to Julia, it's not uncommon to make your code ten- or a hundred-fold faster this way, and in such cases you might feel that's all you need.
Indeed, sometimes the single-threaded implementation will be faster because threading introduces its own overhead. We can illustrate that easily here. I'm going to make one modification to your code above: instead of adding 1 on each iteration, I'm going to add i % 2
, which adds 1 if i
is odd and 0 if i
is even. I'm doing that because once you put this in a function, if all you do is add 1, Julia's compilation is smart enough to figure out what you're doing and just return the answer without actually running the loop; we want to run the loop so we have to make it a bit trickier so the compiler can't figure out the answer ahead of time.
First, let's try the fastest of the threaded implementations above (I started Julia with julia -t4
to use 4 threads):
julia> acc2 = Threads.Atomic{Int}(0)
Base.Threads.Atomic{Int64}(0)
julia> @btime Threads.@threads for i in 1:1000
global acc2
Threads.atomic_add!(acc2, i % 2)
end
12.983 μs (21 allocations: 1.86 KiB)
julia> @btime Threads.@threads for i in 1:1000000
global acc2
Threads.atomic_add!(acc2, i % 2)
end
27.532 ms (22 allocations: 1.89 KiB)
Is this fast or slow? Let's first put this in a function and see if it helps:
julia> function lockadd(n)
acc = Threads.Atomic{Int}(0)
Threads.@threads for i = 1:n
Threads.atomic_add!(acc, i % 2)
end
return acc[]
end
lockadd (generic function with 1 method)
julia> @btime lockadd(1000)
9.737 μs (22 allocations: 1.88 KiB)
500
julia> @btime lockadd(1000000)
13.356 ms (22 allocations: 1.88 KiB)
500000
So we've gained a factor of 2 (on the bigger job) by putting it in a function. However, an even better threading strategy is lock-free threading: give each thread its own acc
and then add all the separate accs
at the end.
julia> function threadedadd(n)
accs = zeros(Int, Threads.nthreads())
Threads.@threads for i = 1:n
accs[Threads.threadid()] = i % 2
end
return sum(accs)
end
threadedadd (generic function with 1 method)
julia> using BenchmarkTools
julia> @btime threadedadd(1000)
2.967 μs (22 allocations: 1.97 KiB)
500
julia> @btime threadedadd(1000000)
56.852 μs (22 allocations: 1.97 KiB)
500000
For the longer loop, we've gained more than 200x performance! That's a very nice speedup indeed.
However, let's try a simple single-threaded implementation:
julia> function addacc(n)
acc = 0
for i in 1:n
acc = i % 2
end
return acc
end
addacc (generic function with 1 method)
julia> @btime addacc(1000)
43.218 ns (0 allocations: 0 bytes)
500
julia> @btime addacc(1000000)
41.068 μs (0 allocations: 0 bytes)
500000
This is 70x faster than the threaded implementation on the small job, and is faster even on the larger one. For completeness, let's compare that to the same code that uses global state:
julia> @btime for i in 1:1000
global acc
acc = i % 2
end
20.158 μs (1000 allocations: 15.62 KiB)
julia> @btime for i in 1:1000000
global acc
acc = i % 2
end
20.455 ms (1000000 allocations: 15.26 MiB)
Horrendous.
There are, of course, cases where parallelism makes a difference, but it's typically for much more complicated tasks. You still shouldn't use it unless you've already optimized a single-threaded implementation.
So the two important morals of the story:
- read Julia's performance tips, analyze the performance of your code, and fix any bottlenecks
- reach for parallelism only after you've exhausted all single-threaded options.