Performance differences in Eigen between auto and Eigen::Ref and concrete type-CodePudding

I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.

I have designed a benchmark like this

static void value(benchmark::State &state) {
  for (auto _ : state) {
    const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
        Eigen::Matrix<double, 9, 1>::Random();
    auto start = std::chrono::high_resolution_clock::now();

    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0   v1   v2;
    const Eigen::Vector3d v = vt.transpose() * vt * vt   vt;

    benchmark::DoNotOptimize(v);
    auto end = std::chrono::high_resolution_clock::now();

    auto elapsed_seconds =
        std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
    state.SetIterationTime(elapsed_seconds.count());
  }
}

I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D> and auto for the v0, v1, v2, vt.

The results of this benchmarks are

Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               23.4 ns          113 ns     29974946
ref/manual_time                 23.0 ns          111 ns     29934053
with_auto/manual_time           23.6 ns          112 ns     29891056

As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
value/manual_time               2475 ns         3070 ns       291032
ref/manual_time                 2482 ns         3077 ns       289258
with_auto/manual_time           2436 ns         3012 ns       263170

Again, the three cases behave the same.

If I understand correctly, the first case, using Eigen::Vector3d should be slower, as it has to keep the copies, perform the v0 v1 v2` operation and save it, and then perform another operation and save.

The auto case should be the fastest, as it should be skipping all the writings.

The ref case I think it should be as fast as auto. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d, so the operations should be skipped, right?

Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?

CodePudding user response：

One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3, here is the assembly code actually benchmarked (available on GodBolt):

        mov     rbx, rax
        mov     rax, qword ptr [rsp   24]
        cmp     rax, 2
        jle     .LBB0_17
        cmp     rax, 5
        jle     .LBB0_17
        cmp     rax, 8
        jle     .LBB0_17
        mov     rax, qword ptr [rsp   16]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax   16]      # xmm1 = mem[0],zero
        movupd  xmm2, xmmword ptr [rax   24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax   48]
        addsd   xmm1, qword ptr [rax   40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax   64]
        movapd  xmm2, xmm0
        mulpd   xmm2, xmm0
        movapd  xmm3, xmm2
        unpckhpd        xmm3, xmm2                      # xmm3 = xmm3[1],xmm2[1]
        addsd   xmm3, xmm2
        movapd  xmm2, xmm1
        mulsd   xmm2, xmm1
        addsd   xmm2, xmm3
        movapd  xmm3, xmm1
        mulsd   xmm3, xmm2
        unpcklpd        xmm2, xmm2                      # xmm2 = xmm2[0,0]
        mulpd   xmm2, xmm0
        addpd   xmm2, xmm0
        movapd  xmmword ptr [rsp   32], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rsp   48], xmm3

This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now() should use a RDTSC/RDTSCP instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.

CodePudding user response：

Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.

To check this, I put the code in a function, to look at the assembler:

Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
    const Eigen::Vector3d v0 = vs.segment<3>(0);
    const Eigen::Vector3d v1 = vs.segment<3>(3);
    const Eigen::Vector3d v2 = vs.segment<3>(6);
    const Eigen::Vector3d vt = v0   v1   v2;
    return vt.transpose() * vt * vt   vt;
}

which turns into this assembler

        push    rax
        mov     rax, qword ptr [rsi   8]
...
        mov     rax, qword ptr [rsi]
        movupd  xmm0, xmmword ptr [rax]
        movsd   xmm1, qword ptr [rax   16]
        movupd  xmm2, xmmword ptr [rax   24]
        addpd   xmm2, xmm0
        movupd  xmm0, xmmword ptr [rax   48]
        addsd   xmm1, qword ptr [rax   40]
        addpd   xmm0, xmm2
        addsd   xmm1, qword ptr [rax   64]
...
        movupd  xmmword ptr [rdi], xmm2
        addsd   xmm3, xmm1
        movsd   qword ptr [rdi   16], xmm3
        mov     rax, rdi
        pop     rcx
        ret

Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.

This only works since we deal with fixed-sized vectors. With VectorXd copies would definitely take place.

Alternative benchmarks

Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:

struct Foo
{
public:
    Eigen::Vector3d v0;
    Eigen::Vector3d v1;
    Eigen::Vector3d v2;
    
    Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
    Eigen::Vector3d operator()() const __attribute__((noinline));
};

Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
  v1(vs.segment<3>(3)),
  v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
    const Eigen::Vector3d vt = v0   v1   v2;
    return vt.transpose() * vt * vt   vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
    Foo f(vs);
    return f();
}

By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.