Efficiency of Eigen::Ref compared to Eigen::VectorXd as function arguments-CodePudding

I have a long vector Eigen::VectorXd X;, and I would like to update it segment-by-segment using one of the following functions:

void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
    // Update x.
}

Eigen::VectorXd Foo2() {
    Eigen::VectorXd x;
    // Update x.
    return x;
}

int main() {
    const int LARGE_NUMBER = ...;        // Approximately in the range ~[600, 1000].
    const int SIZES[] = {...};           // Entries roughly in the range ~[2, 20].

    Eigen::VectorXd X{LARGE_NUMBER};

    int j = 0;
    for (int i = 0; i < LARGE_NUMBER; i  = SIZES[j]) {
        // Option (1).
        Foo1(X.segment(i, SIZES[j]));
        // Option (2)
        X.segment(i, SIZES[j]) = Foo2();

          j;
    }

    return 0;
}

Given the above specifications, which option would be the most efficient? I would say (1) because it would directly modify the memory without creating any temporaries. However, compiler optimizations could potentially make (2) perform better -- e.g., see this post.

Secondly, consider the following functions:

void Foo3(const Eigen::Ref<const Eigen::VectorXd>& x) {
    // Use x.
}

void Foo4(const Eigen::VectorXd& x) {
    // Use x.
}

Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments? That is, is Foo3(X.segment(...)) always at least as efficient as Foo4(X.segment(...))?

CodePudding user response：

Given the above specifications, which option would be the most efficient?

Most likely option 1 as you have guessed. It depends on what the update entails, of course. So you may need some benchmarking. But in general the cost of allocation is significant compared to the minor optimizations allowed by allocating a new object. Plus, option 2 incurs the additional cost of copying the result.

Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments?

If you call Foo4(x.segment(...)) it allocates a new vector and copies the segment into it. That is significantly more expensive than Foo3. And the only thing you gain is that the vector will be properly aligned. This is only a minor benefit on modern CPUs. So I would expect Foo3 to be more efficient.

Note that there is one option that you have not considered: Use templates.

template<class Derived>
void Foo1(const Eigen::MatrixBase<Derived>& x) {
    Eigen::MatrixBase<Derived>& mutable_x = const_cast<Eigen::MatrixBase<Derived>&>(x);
    // Update mutable_x.
}

The const-cast is annoying but harmless. Please refer to Eigen's documentation on that topic. https://eigen.tuxfamily.org/dox/TopicFunctionTakingEigenTypes.html

Overall, this will allow approximately the same performance as if you inlined the function body. In your particular case, it may not be any faster than an inlined version of Foo1, though. This is because a general segment and a Ref object have basically the same performance.

Efficiency of accessing Ref vs. Vector

Let's look at the performance in more detail between computations on an Eigen::Vector, an Eigen::Ref<Vector>, an Eigen::Matrix and an Eigen::Ref<Matrix>. Eigen::Block (the return type for Vector.segment() or Matrix.block()) is functionally identical to Ref, so I don't bother mentioning it further.

Vector and Matrix guarantee that the array as a whole is aligned to 16 byte boundaries. That allows operations to use aligned memory accesses (e.g. movapd in this instance).
Ref does not guarantee alignment and therefore requires unaligned accesses (e.g. movupd). On very old CPUs this used to have a significant performance penalty. These days it is less relevant. It is nice to have alignment but it is no longer the be-all-end-all for vectorization. To quote Agner on that topic [1]:

Some microprocessors have a penalty of several clock cycles when accessing misaligned data that cross a cache line boundary.
Most XMM instructions without VEX prefix that read or write 16-byte memory operands require that the operand is aligned by 16. Instructions that accept unaligned 16-byte operands can be quite inefficient on older processors. However, this restriction is largely relieved with the AVX and later instruction sets. AVX instructions do not require alignment of memory operands, except for the explicitly aligned instructions. Processors that support the AVX instruction set generally handle misaligned memory operands very efficiently.

All four data types guarantee that the inner dimension (only dimension in vector, single column in matrix) is stored consecutively. So Eigen can vectorize along this dimension
Ref does not guarantee that elements along the outer dimension are stored consecutively. There may be a gap from one column to the next. This means that scalar operations like Matrix Matrix or Matrix*Scalar can use a single loop over all elements in all rows and columns while Ref Ref need a nested loop with an outer loop over all columns and an inner loop over all rows.
Neither Ref nor Matrix guarantee proper alignment for a specific column. Therefore most matrix operations such as matrix-vector products need to use unaligned accesses.
If you create a vector or matrix inside a function, this may help escape and alias analysis. However, Eigen already assumes no aliasing in most instances and the code that Eigen creates leaves little room for the compiler to add anything. Therefore it is rarely a benefit.
There are differences in the calling convention. For example in Foo(Eigen::Ref<Vector>), the object is passed by value. Ref has a pointer, a size, and no destructor. So it will be passed in two registers. This is very efficient. It is less good for Ref<Matrix> which consumes 4 registers (pointer, rows, columns, outer stride). Foo(const Eigen::Ref<const Vector>&) would create a temporary object on the stack and pass the pointer to the function. Vector Foo() returns an object that has a destructor. So the caller allocates space on the stack, then passes a hidden pointer to the function. Usually, these differences are not significant but of course they exist and may be relevant in code that does very little computation with many function calls

With these differences in mind, let's look at the specific case at hand. You have not specified what the update method does, so I have to make some assumptions.

The computations will always be the same so we only have to look at memory allocations and accesses.

Example 1:

void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
    x = Eigen::VectorXd::LinSpaced(x.size(), 0., 1.);
}
Eigen::VectorXd Foo2(int n) {
    return Eigen::VectorXd::LinSpaced(n, 0., 1.);
}
x.segment(..., n) = Foo2(n);

Foo1 does one unaligned memory write. Foo2 does one allocation and one aligned memory write into the temporary vector. Then it copies to the segment. That will use one aligned memory read and an unaligned memory write. Therefore Foo1 is clearly better in all circumstances.

Example 2:

void Foo3(Eigen::Ref<Eigen::VectorXd> x)
{
    x = x * x.maxCoeff();
}
Eigen::VectorXd Foo4(const Eigen::Ref<Eigen::VectorXd>& x)
{
    return x * x.maxCoeff();
}
Eigen::VectorXd Foo5(const Eigen::Ref<Eigen::VectorXd>& x)
{
    Eigen::VectorXd rtrn = x;
    rtrn = rtrn * rtrn.maxCoeff();
    return rtrn;
}

Both Foo3 and 4 do two unaligned memory reads from x (one for the maxCoeff, one for the multiplication). After that, they behave the same as Foo1 and 2. Therefore Foo3 is always better than 4.

Foo5 does one unaligned memory read and one aligned memory write for the initial copy, then two aligned reads and one aligned write for the computation. After that follow the copy outside the function (same as Foo2). This is still a lot more than what Foo3 does but if you do a lot more memory accesses to the vector, it may be worthwhile at some point. I doubt it, but cases may exist.

The main take-away is this: Since you ultimately want to store the results in segments of an existing vector, you can never fully escape the unaligned memory accesses. So it is not worth worrying about them too much.

Template vs. Ref

A quick rundown of the differences:

The templated version will (if written properly) work on all data types and all memory layouts. For example if you pass a full vector or matrix, it can exploit the alignment.

There are cases where Ref will simply not compile, or work differently than expected. As written above, Ref guarantees that the inner dimension is stored consecutively. The call Foo1(Matrix.row(1)) will not work, because a matrix row is not stored consecutively in Eigen. And if you call a function with const Eigen::Ref<const Vector>&, Eigen will copy the row into a temporary vector.

The templated version will work in these cases, but of course it cannot vectorize.

The Ref version has some benefits:

It is clearer to read and has fewer chances to go wrong with unexpected inputs
You can put it in a cpp file and it creates less redundant code. Depending on your use case, more compact code may be more beneficial or appropriate

[1] https://www.agner.org/optimize/optimizing_assembly.pdf