I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], ,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
@time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res = W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?
CodePudding user response:
It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
@. res = W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> @time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res = W[m,n] * M[m,n]
You get two allocations:
- When you do multiplication
W[m,n] * M[m,n]
a new matrix is allocated. - When you do addition
res = ...
again a matrix is allocated
By using broadcasting with @.
you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W
and M
is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.
CodePudding user response:
I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce
variant differ in a fundamental semantic way.
The purpose of mapreduce
is to reduce by an associative operation with identity element init
in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of
f
for elements that appear multiple times initr
. Usemapfoldl
ormapfoldr
instead for guaranteed left or right associativity and invocation off
for every value.
and
It is unspecified whether
init
is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like
reduce
, but with guaranteed left associativity. If provided, the keyword argumentinit
will be used exactly once.
julia> @benchmark foldl((acc, (m, w)) -> (@. acc = m * w), $work; init=$(similar(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mapreduce(Base.splat(*), , $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with ( )
reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.