Armadillo: Inefficient chaining of .t()-CodePudding

consider the following two ways of doing the same thing.

arma::Mat<double> B(5000,5000,arma::fill::randu);
arma::Mat<double> C(5000,500, arma::fill::randu);

Okay two dense matrices in memory. Now I want to multiply them to a new matrix, but with B transposed. Method 1:

arma::Mat<double> A = B.t() * C;

Method 2:

arma::Mat<double> Bt = B.t()
arma::Mat<double> A = Bt * C;

Which one is faster? Method 2! By a factor of about 2.5x! Now if we allocate A before we do the multiplication, it doesn't change the time for method 2. It speeds up Method 1, but it is still 2x as slow as Method 2.

This seems bizarre to me, since I would have thought if there was no templating stuff going on at compile time that the machine code would be almost identical. So why would they have templated it in such a way that actually made it worse? Or am I missing something major?

Storing B.t() in memory as Bt and doing arma::inplace_trans(B) are about equally expensive from a time perspective. Obviously Bt = B.t() takes more memory, but you have the advantage of keeping both. I made B square so the number of multiplications is the same as A = B * C.

A = B * C --> 6.98 seconds

Bt = B.t(); A = Bt * C --> 7.02 seconds

A = B.t() * C --> 18.6124 seconds, or 14.56 seconds when A is pre-allocated (??)

I went down this rabbit-hole to see how I should store B to be more efficient, as I can construct it the other way. Especially once I start extracting rows or columns. But the difference between extracting a row and a column is actually unobservable at this scale! To be clear: A = B.rows(0, 499) * C is much faster than A = B.cols(0, 499).t() * C. I know they aren't the same mathematically, but if I had constructed B the other way round I was hoping for some performance benefit by accessing contiguous blocks of memory. Even A = B.rows(0,499) and A = B.cols(0, 499) are almost identical in terms of cost, which came as a surprise to me, but the scope of the question is starting to get too big.

PS: I am using OpenBLAS

CodePudding user response：

Hi all I'm going to answer my own question here might be useful to others. The answer for me is that it was because I was using a generic OpenBLAS, not an Intel processor-specific version of BLAS, and running in debug mode.

With optimization at compile time and using an Intel processor-specific version of BLAS:

Bt = B.t() and then A = Bt * C is definitely slower than A = B.t() * C, as we would expect due to the storing of the intermediate step.
A = B.t() * C is faster than A = B * C, if B is square (I know this isn't the same number), but the difference is small, maybe 0-20% for the numbers I am using.
Along a similar line, A = B.rows(0, 499) * C is slower than A = B.cols(0, 499).t() * C.

The explanation is I believe that column access is faster than row access. B.t() * C uses columns of both B and C, whereas B * C uses rows of B and columns of C.

All of this is much faster than loops. So use BLAS over C manual loops -- this is way more important than worrying about rows vs columns.

One anomaly: A = B.rows(0, 499) is still faster than A = B.cols(0, 499). Any ideas on why would be appreciated!

P.S. also tips on handing tensors higher than 2D in C would be appreciated. I hate arma::Cubes although I do use them.