I am trying to understand how Eigen::Ref works to see if I can take some advantage of it in my code.
I have designed a benchmark like this
static void value(benchmark::State &state) {
for (auto _ : state) {
const Eigen::Matrix<double, Eigen::Dynamic, 1> vs =
Eigen::Matrix<double, 9, 1>::Random();
auto start = std::chrono::high_resolution_clock::now();
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 v1 v2;
const Eigen::Vector3d v = vt.transpose() * vt * vt vt;
benchmark::DoNotOptimize(v);
auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds =
std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
state.SetIterationTime(elapsed_seconds.count());
}
}
I have two more tests like thise, one using const Eigen::Ref<const Eigen::Vector3D>
and auto
for the v0, v1, v2, vt
.
The results of this benchmarks are
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 23.4 ns 113 ns 29974946
ref/manual_time 23.0 ns 111 ns 29934053
with_auto/manual_time 23.6 ns 112 ns 29891056
As you can see, all the tests behave exactly the same. So I thought that maybe the compiler was doing its magic and decided to test with -O0. These are the results:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
value/manual_time 2475 ns 3070 ns 291032
ref/manual_time 2482 ns 3077 ns 289258
with_auto/manual_time 2436 ns 3012 ns 263170
Again, the three cases behave the same.
If I understand correctly, the first case, using Eigen::Vector3d
should be slower, as it has to keep the copies, perform the
v0 v1 v2` operation and save it, and then perform another operation and save.
The auto
case should be the fastest, as it should be skipping all the writings.
The ref
case I think it should be as fast as auto
. If I understand correctly, all my operations can be stored in a reference to a const Eigen::Vector3d
, so the operations should be skipped, right?
Why are the results all the same? Am I misunderstanding something or is the benchmark just bad designed?
CodePudding user response:
One big issue with the benchmark is that you measure the time in the hot benchmarking loop. The thing is measuring the time take some time and it can be far more expensive than the actual computation. In fact, I think this is what is happening in your case. Indeed, on Clang 13 with -O3
, here is the assembly code actually benchmarked (available on GodBolt):
mov rbx, rax
mov rax, qword ptr [rsp 24]
cmp rax, 2
jle .LBB0_17
cmp rax, 5
jle .LBB0_17
cmp rax, 8
jle .LBB0_17
mov rax, qword ptr [rsp 16]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax 16] # xmm1 = mem[0],zero
movupd xmm2, xmmword ptr [rax 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax 48]
addsd xmm1, qword ptr [rax 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax 64]
movapd xmm2, xmm0
mulpd xmm2, xmm0
movapd xmm3, xmm2
unpckhpd xmm3, xmm2 # xmm3 = xmm3[1],xmm2[1]
addsd xmm3, xmm2
movapd xmm2, xmm1
mulsd xmm2, xmm1
addsd xmm2, xmm3
movapd xmm3, xmm1
mulsd xmm3, xmm2
unpcklpd xmm2, xmm2 # xmm2 = xmm2[0,0]
mulpd xmm2, xmm0
addpd xmm2, xmm0
movapd xmmword ptr [rsp 32], xmm2
addsd xmm3, xmm1
movsd qword ptr [rsp 48], xmm3
This code can be executed in few dozens of cycles so probably less than 10-15 ns on a 4~5 GHz modern x86 processor. Meanwhile high_resolution_clock::now()
should use a RDTSC
/RDTSCP
instruction that also takes dozens of cycles to complete. For example, on a Skylake processor, it should take about 25 cycles (similar on newer Intel processor). On an AMD Zen processor, it takes about 35-38 cycles. Additionally, it adds a synchronization that may not be representative of the actual application. Please consider measuring the time of a benchmarking loop with many iterations.
CodePudding user response:
Because everything happens inside a function, the compiler can do escape analysis and optimize away the copies into the vectors.
To check this, I put the code in a function, to look at the assembler:
Eigen::Vector3d foo(const Eigen::VectorXd& vs)
{
const Eigen::Vector3d v0 = vs.segment<3>(0);
const Eigen::Vector3d v1 = vs.segment<3>(3);
const Eigen::Vector3d v2 = vs.segment<3>(6);
const Eigen::Vector3d vt = v0 v1 v2;
return vt.transpose() * vt * vt vt;
}
which turns into this assembler
push rax
mov rax, qword ptr [rsi 8]
...
mov rax, qword ptr [rsi]
movupd xmm0, xmmword ptr [rax]
movsd xmm1, qword ptr [rax 16]
movupd xmm2, xmmword ptr [rax 24]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rax 48]
addsd xmm1, qword ptr [rax 40]
addpd xmm0, xmm2
addsd xmm1, qword ptr [rax 64]
...
movupd xmmword ptr [rdi], xmm2
addsd xmm3, xmm1
movsd qword ptr [rdi 16], xmm3
mov rax, rdi
pop rcx
ret
Notice how the only memory operations are two GP register loads to get start pointer and length, then a couple of memory loads to get the vector content into registers, before we write the result to memory in the end.
This only works since we deal with fixed-sized vectors. With VectorXd
copies would definitely take place.
Alternative benchmarks
Ref is typically used on function calls. Why not try it with a function that cannot be inlined? Or come up with an example where escape analysis cannot work and the objects really have to be materialized. Something like this:
struct Foo
{
public:
Eigen::Vector3d v0;
Eigen::Vector3d v1;
Eigen::Vector3d v2;
Foo(const Eigen::VectorXd& vs) __attribute__((noinline));
Eigen::Vector3d operator()() const __attribute__((noinline));
};
Foo::Foo(const Eigen::VectorXd& vs)
: v0(vs.segment<3>(0)),
v1(vs.segment<3>(3)),
v2(vs.segment<3>(6))
{}
Eigen::Vector3d Foo::operator()() const
{
const Eigen::Vector3d vt = v0 v1 v2;
return vt.transpose() * vt * vt vt;
}
Eigen::Vector3d bar(const Eigen::VectorXd& vs)
{
Foo f(vs);
return f();
}
By splitting initialization and usage into non-inline functions, the copies really have to be done. Of course we now change the entire use case. You have to decide if this is still relevant to you.