Speed-up eigen c transpose?-CodePudding

I know that this 'eigen speed-up' questions arise regularly but after reading many of them and trying several flags I cannot get a better time with c eigen comparing with the traditional way of performing a transpose. Actually using blocking is much more efficient. The following is the code

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <iostream>
#include <Eigen/Dense>

#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )

int main(){
    const int n = 10000;
    const int csize = 32;
    float **a, **b;
    clock_t cputime1, cputime2;
    int i,j,k,ii,jj,kk;
  
    // Allocating memory for array/matrix
    a = new float * [n];
    for (i=0; i<n; i  ){
        a[i] = new float [n];
    }
    b = new float * [n];
    for (i=0; i<n; i  ){
        b[i] = new float[n];
    }
    // eigen matrices
    Eigen::MatrixXf M1 = Eigen::MatrixXf::Constant(n, n, 0.0);
    Eigen::MatrixXf M2 = Eigen::MatrixXf::Constant(n, n, 0.0);
    
    // Filling matrices with zeros
    for(i=0; i<n;   i)
        for (j=0; j<n;   j)
            a[i][j] = 0;
    for(i=0; i<n;   i)
        for (j=0; j<n;   j)
            b[i][j] = 0;

    // Direct (inefficient) transposition
    cputime1 = clock();
    for (i=0; i<n;   i)
        for (j=0; j<n;   j)
            a[i][j] = b[j][i];
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // Transposition using cache-blocking
    cputime1 = clock();
    for (ii=0; ii<n; ii =csize)
        for (jj=0; jj<n; jj =csize)
            for (i=ii; i<min(n,ii csize-1);   i)
                for (j=jj; j<min(n,jj csize-1);   j)
                    a[i][j] = b[j][i];
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // eigen
    cputime1 = clock();
    M1.noalias() = M2.transpose();
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition with eigen: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // use data
    std::cout << a[n/2][n/2] << std::endl;
    std::cout << b[n/2][n/2] << std::endl;
    std::cout << M1(n/2,n/2) << std::endl;

    return 0;
}

And the compiling command I am using is

g   -fno-math-errno -ffast-math -march=native -fopenmp -O2 -msse2 -DNDEBUG  blocking_and_eigen.cpp

with results

Time for transposition: 1.926674
Time for transposition: 0.280653
Time for transposition with eigen: 2.018217

I am using eigen 3.4.0, and g 11.2.0.

Do you have any suggestion to improve eigen performance? Thanks in advance

CodePudding user response：

As suggested by INS in the comment is the actual copying of the matrix causing the performance drop, I slightly modify your example to use some numbers instead of all zeros (to avoid any type of optimisation):

for(i=0; i<n;   i) {
    for (j=0; j<n;   j) {
        a[i][j] = i j;
        M1(i,j) = i j;
      }
}
for(i=0; i<n;   i) {
    for (j=0; j<n;   j) {
        b[i][j] = i j;
        M1(i,j) = i j;
    }
}

Also, I modify the final printing statement with a full check over the result (when not in place the check will be performed against M2):

    for (i=0; i<n;   i)
    for (j=0; j<n;   j)
      if (a[i][j] != M1(i,j))
        std::cout << "Diff here! " << std::endl;

Then I tried several tests:

Preallocation and assignment

Eigen::MatrixXf M2 = Eigen::MatrixXf::Constant(n, n, 0.0); ... some code here ... M2 = M1.transpose();
Copy constructor

Eigen::MatrixXf M2(M1.transpose());
in place

M1.transposeInPlace();
copy construct using auto and c 17

auto M2{ M1.transpose() };

This is the most puzzling, the performance are outstanding, I think there are two part in the story, if I print the typeid name of M2 for case 2 and 4 they are different, and the name is mangled but it give us a clue:

N5Eigen6MatrixIfLin1ELin1ELi0ELin1ELin1EEE N5Eigen9TransposeINS_6MatrixIfLin1ELin1ELi0ELin1ELin1EEEEE

auto keyword resolve to a different type specific for transpose matrix. The second part of the story is the fact that M1 is not modify afterwards, so either the compiler moves it or, most likely the EigenTransposeMatrix (https://eigen.tuxfamily.org/dox/classEigen_1_1Transpose.html) is only keeping a reference of the original matrix and it doesn't copy it.

Results

Test	Direct (s)	Cache block (s)	eigen (s)
1	2.633	0.312	1.861
2	2.599	0.262	1.968
3	2.602	0.262	0.216
4	2.552	0.280	0.000002