Performance differences when looping over arrays of structs-CodePudding

I have recently started using Julia to speed up some code previously written in Python. I only have prior experience with Python, so this is my first time caring about performance and I have found some strange behavior when looping over an array of structs. I am defining a new struct Gaussian, which represent a 2d Gaussian function and a function intensity() which calculates the amplitude of the function at a given position:

struct Gaussian{T<:Float32}
    x0::T
    y0::T
    A::T
    a::T
    b::T
    c::T
end

function intensity(
    model::Gaussian,
    x::Float32,
    y::Float32
    )
    gaussian_value::Float32 = model.A*exp(
        -( 
            model.a * (x - model.x0)^2  
            2 * model.b * (x - model.x0) * (y - model.y0)  
            model.c * (y - model.y0)^2
        )
    )
    return gaussian_value
end

Then, I make an array of 2000 random instances of Gaussian:

function build_array()
    length = 2000
    random_pos = [rand(Float32, (1, 2)) for i in 1:length]
    random_A = rand(Float32, (length, 1))
    random_a = rand(Float32, (length, 1))
    random_b = rand(Float32, (length, 1))
    random_c = rand(Float32, (length, 1));

    gaussians::Array{Gaussian} = []

    for (pos, A, a, b, c) in zip(
        random_pos,
        random_A,
        random_a,
        random_b,
        random_c
        )
        new_gaussian = Gaussian(pos..., A, a, b, c)
        push!(gaussians, new_gaussian)
    end
    
    return gaussians
end

gaussians = build_array()

When I benchmark a single call to the intensity() function, it takes ~100 ns with 1 allocation (makes sense). I would expect that looping over the array of Gaussians should then take 2000*100 ns = 200 us. However, it actually takes about twice as long:

function total_intensity1(gaussian_list::Array{Gaussian})
    total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11)))
end

function total_intensity2(gaussian_list::Array{Gaussian})
    total::Float32 = 0.
    for gaussian in gaussian_list
        total  = intensity(gaussian, Float32(0.75), Float32(0.11))
    end
    return total 
end

@btime sum(intensity.(gaussians, Float32(0.75), Float32(0.11)))
@btime begin
    total::Float32 = 0.
    for gauss in gaussians
        total  = intensity(gauss, Float32(0.75), Float32(0.11))
    end
    total
end
@btime total_intensity1(gaussians)
@btime total_intensity2(gaussians)

  397.700 μs (16004 allocations: 258.02 KiB)
  285.800 μs (8980 allocations: 234.06 KiB)
  396.100 μs (16002 allocations: 257.95 KiB)
  396.700 μs (16001 allocations: 250.02 KiB)

The number of allocations is also much larger than I would expect and there is a difference between the second and fourth method even though the code is pretty much the same. My questions:

Where do these differences come from?
How can I improve the performance of the code?

CodePudding user response：

I don't have time to figure this one out in detail unfortunately, but the first thing I'd say is check whether this is a benchmarking artifact - gaussians is a global variable which should be interpolated into the benchmark using $.

As to your function, the type annotations are not doing anything for performance here, and will make your function less composable (e.g. you won't be able to autodiff through it give you're restricting everything to Float32).

Here's how I would write it:

function intensity(m, x, y)
    (; x₀, y₀, A, a, b, c) = m # destructuring input
    A * exp( -(a * (x - x₀)^2   2b * (x - x₀) * (y - y₀)   c * (y - y₀)^2 ) )
end

With that I'm getting:

  231.100 μs (12001 allocations: 195.44 KiB)
  231.500 μs (12001 allocations: 195.44 KiB)
  229.200 μs (12000 allocations: 187.50 KiB)
  229.300 μs (12000 allocations: 187.50 KiB)

which is about 100μs faster than your original version on my machine.

CodePudding user response：

Your struct definition Gaussian{T<:Float32} is somewhat nonsensical. Float32 cannot have any subtypes, so T can only be a Float32. Therefore, either remove the restriction, replace it with something else (e.g. Real), or just take away the type parameter entirely.

This is bad:

gaussians::Array{Gaussian} = []

It creates a Vector{Any} which it then converts to a Vector{Gaussian}. Worse, Vector{Gaussian} is not a Vector{Gaussian{Float32}}. So either remove the whole type parameter, or make sure to use it. So,

# good:
gaussians = Vector{Gaussian{Float32}}()
gaussians = Gaussian{Float32}[] # same as above

# bad
gaussians = Vector{Gaussian}()
# very bad, don't use this style, put types on the right hand side when constructing.
gaussians::Vector{Gaussian} = []

Same here, bad style

total::Float32 = 0.

Do this instead

total = Float32(0.0)
# or use Float32 literal
total = 0.0f0

BTW, you'll have to modify some of your function definitions:

total_intensity1(gaussian_list::Array{Gaussian})

should be

total_intensity1(gaussian_list::Array{<:Gaussian})

There's more, but this is a start.

Edit: OK, a few more things:

rand(Float32, (length, 1)) length is a super important function in Base, so it's normally good not to shadow it like this. And, make vectors instead of matrices:

rand(Float32, (N, 1)) # this is an Nx1 matrix

rand(Float32, N) # this is a length-N vector
push!(gaussians, new_gaussian) This iteratively resizes the vector over and over. When you know the size of the vector as in your case, it is better to pre-allocate:

gaussians = Vector{Gaussian{Float32}}(undef, 2000)
You can avoid an unnecessary allocation here:

total = sum(intensity.(gaussian_list, Float32(0.75), Float32(0.11))) like this:

total = sum(g->intensity(g, 0.75f0, 0.11f0), gaussian_list)

Explanation: sum(f.(x)) first creates the array f.(x), then sums it, while sum(f, x) just applies f to each element before adding it to the sum.