Home > OS >  Accelerate rownorm(A*R')
Accelerate rownorm(A*R')

Time:10-30

In the algorithm I am writing I didn't expect the following part to be the bottleneck. Here is a trimmed down version of my code:

using LinearAlgebra

A = rand(1000,100)
R = triu(rand(100,100))
for i = 1:300
    R = triu(rand(100,100))
    @views nrms = norm.(eachrow(A[i:end, :] * R'))
end

Is there a way to accelerate the computation of nrms?

I could perfectly store A transposed instead of how I am storing it now if that helps, but the impact seems minimal, for example

@views nrms = norm.(eachcol(conj(R)*AT[:,i:end])

with AT = copy(transpose(A)). I also tried writing manually a loop that would avoid storing the product A[i:end, :] * R' but this was always much slower as no blas was used then for gemm.

CodePudding user response:

Beating BLAS on performance on (relatively) big matrix operations is quite challenging. But it is possible with LoopVectorization. Here's a manual loop with the @tturbo annotation, which is the multi-threaded version of @turbo.

This does not take advantage of the triangular shape of R, because LoopVectorization only works on rectangular index regions, though that will probably change in the future. Then a further speedup of ~2x should be possible

using LoopVectorization

function rownorm2(A::AbstractArray{S}, R::AbstractArray{T}) where {S, T}
    P = float(promote_type(S, T))
    vals = zeros(P, size(A, 1))
    @tturbo for k in axes(A, 1)
        valk = zero(P)
        for i in axes(R, 2)
            vali = zero(P)
            # for j in i:lastindex(R, 1)  # <- does not work with LoopVectorization
            for j in axes(R, 1)
                vali  = A[k, j] * R[j, i]
            end
            valk  = vali^2
        end
        vals[k] = sqrt(valk)
    end
    return vals
end

I use a slightly modified version of your code with R lower triangular, instead of upper:

rownorm(A, R) = norm.(eachrow(A * R))

Benchmark:

A = rand(1000,100);
R = tril(rand(100,100));

julia> rownorm(A, R) ≈ rownorm2(A, R)
true

julia> @btime rownorm($A, $R);
  393.700 μs (5 allocations: 828.34 KiB)

julia> @btime rownorm2($A, $R);
  77.100 μs (1 allocation: 7.94 KiB)

This is on a laptop with 6 cores and 12 threads. You must start Julia with threads enabled. The achieved performance depends on what sort of simd vectorization your computer supports.

CodePudding user response:

This question is quite difficult as optimization usually relies on unused extra information from beyond the algorithm in focus. As the question is phrased now, with random matrix inputs, it is almost asking the generic question of improving the already optimized matrix multiplication operation (like Strassen's algorithm does, but not so much, really). DNF impressively recovers all the speed of the BLAS implementation in pure Julia using the extra bit of context that only the norm is needed as output, and not the whole matrix!

The idea behind this answer is to use the specific loop structure of the calculation in the question, i.e. a constant big matrix (1000x100) multiplied by 300 smaller (100x100 triangular matrices). In order to squeeze more performance, consider generating the triangular matrices in a transposed fashion: the first column of all 300, the second... and so on. In code:

function way1()  # original method from OP
    A = rand(1000,100)
    nrms = zeros(1000,300)
    for i = 1:300
        R = triu(rand(100,100))
        @views nrms[:,i] = norm.(eachrow(A * R'))
    end
    nrms
end

function way2()  # new transposed method
    A = rand(1000,100)
    nrms = zeros(1000,300)
    for i = 1:100
        R = rand(i,300)
        @views nrms . = (A[:,1:i] * R).^2
    end
    nrms .= sqrt.(nrms)
end

The transposed method fits nicely into the memory layout and gives the following benchmark:

julia> @btime way1();
  247.788 ms (2704 allocations: 291.53 MiB)

julia> @btime way2();
  138.539 ms (398 allocations: 243.50 MiB)

Therefore, if the original usage allows this kind of generation, there is a 40% time saving.

  • Related