High GC time for simple mapreduce problem-CodePudding

I have simulation program written in Julia that does something equivalent to this as a part of its main loop:

# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2],  ,work)

In other words, a simple sum of weighted matrices. Timing the above code yields

0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)

I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:

@time begin
    res = zeros(100,100)
    for m=1:100
        for n=1:100
            res  = W[m,n] * M[m,n]
        end
    end
end

which gave

0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)

So, if I wrote this in C or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?

CodePudding user response：

It is possible to do it in place like this:

function ws(W, M)
    res = zeros(100,100)
    for m=1:100
        for n=1:100
            @. res  = W[m,n] * M[m, n]
        end
    end
    return res
end

and the timing is:

julia> @time ws(W, M);
  0.100328 seconds (2 allocations: 78.172 KiB)

Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).

The problem with your code is that in line:

res  = W[m,n] * M[m,n]

You get two allocations:

When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res = ... again a matrix is allocated

By using broadcasting with @. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.

Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.

CodePudding user response：

I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.

The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:

The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.

and

It is unspecified whether init is used for non-empty collections.

What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:

Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.

julia> @benchmark foldl((acc, (m, w)) -> (@. acc  = m * w), $work; init=$(similar(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  109.967 ms … 118.251 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     112.639 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   112.862 ms ±   1.154 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▄▃█ ▁▄▃                                        
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  110 ms           Histogram: frequency by time          118 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(Base.splat(*),  , $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  403.100 ms … 458.882 ms  ┊ GC (min … max): 4.53% … 3.89%
 Time  (median):     445.058 ms               ┊ GC (median):    4.04%
 Time  (mean ± σ):   440.042 ms ±  16.792 ms  ┊ GC (mean ± σ):  4.21% ± 0.92%

  ▁           ▁                 ▁   ▁      ▁ ▁    ▁▁▁     █   ▁  
  █▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
  403 ms           Histogram: frequency by time          459 ms <

 Memory estimate: 1.49 GiB, allocs estimate: 39998.

Think of it that way: if you would write the function as a parallel for loop with ( ) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.

Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.