performance abnormality in homebrew array reduction-CodePudding

the code below implements two array class, 1 dimensional and 2 dimensional (column major order), and a clock for getting wall clock time.

The function of concern is a reduction of the 2d array into a 1d array via a lambda call back, either along the rows or along the columns. In both cases the 2d array is traversed in the same order. However, dropping the row dimension needs almost twice as much time as dropping the column dimension, which is unclear to me because the major performance driver should be traversing the 2d array.

#include <iostream>
#include <string>
#include <chrono>
#define i64 long long int
using namespace std;
class hdclock{
private:
  std::chrono::time_point<std::chrono::high_resolution_clock> start;
public:
  void tic(){
    this->start=std::chrono::high_resolution_clock::now();
  };
  double toc(){
    auto end=std::chrono::high_resolution_clock::now();
    auto duration = duration_cast<std::chrono::milliseconds>(end - this->start);
    return((double)duration.count()/1000.0);
  }
};
template<class T> class arr1d;
template<class T> class arr2d;
template<class T>
class base{
protected:
  i64 nelements=0;
  T * val=nullptr;
public:
  base(i64 nelements){
    this->nelements=nelements;
    this->val=(T*)malloc(sizeof(T)*this->nelements);
    for(i64 i=0;i<this->nelements;  i){(*this)(i)=(T)i;}
  }
  virtual ~base(){free(this->val);}
  const T& operator()(i64 i)const{return(this->val[i]);}
  T& operator()(i64 i){return(this->val[i]);}
  const i64& size()const{return(this->nelements);}
};
enum class drop{rows,columns};
template<class T>
class arr1d:public base<T>{
protected:
  i64 d1=0;
public:
  arr1d(i64 d1):base<T>(d1){this->d1=d1;};
  ~arr1d(){};
  template<typename F>
  arr1d& reduction(const arr2d<T> &ii,F f,const drop which);
};
template<class T>
class arr2d:public base<T>{
protected:
  i64 d1=0,d2=0;
public:
  arr2d(i64 d1, i64 d2):base<T>(d1*d2){this->d1=d1;this->d2=d2;};
  ~arr2d(){};
  const T& operator()(i64 i,i64 j)const{return(this->val[j*this->d1 i]);}
  T& operator()(i64 i,i64 j){return(this->val[j*this->d1 i]);}
  const i64& size(i64 i)const{if(i==1){return(d1);}else{return(d2);}}
};
//@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
template<typename T> template<typename F>
arr1d<T>& arr1d<T>::reduction(const arr2d<T> &ii,F f,const drop which){
  switch(which){
  case drop::rows:
    if(this->d1!=ii.size(2)){string msg="err";throw msg;}
    for(i64 i=0;i<ii.size(2);  i){
      for(i64 j=0;j<ii.size(1);  j){
    f((*this)(i),ii(j,i));
      }
    }
    break;
  case drop::columns:
    if(this->d1!=ii.size(1)){string msg="err";throw msg;}
    for(i64 i=0;i<ii.size(2);  i){
      for(i64 j=0;j<ii.size(1);  j){
    f((*this)(j),ii(j,i));
      }
    }
    break;
  }
  return *this;
}
int main(){
  arr2d<double> x(70000,70000);
  arr1d<double> y(70000);
  hdclock t;
  try{
    t.tic();
    y.reduction(x,[](double &a, const double &b){a =b;},drop::columns);
    cout<<t.toc()<<endl;
    for(i64 i=0;i<y.size();  i){y(i)=0.0;}
    t.tic();
    y.reduction(x,[](double &a, const double &b){a =b;},drop::rows);
    cout<<t.toc()<<endl;
  }catch(string msg){
    cout<<msg<<endl;return(1);
  }
  return(0);
}

Compiled with clang 12.01 or g 11.1 with flags -std=c 20 -O3 dropping columns needed 2.2 seconds, dropping rows needed 4.5 seconds (intel i9-9980HK, 64GB RAM).

Any suggestions/explanations for the performance difference and possible solution for speeding up the slower are highly appreciated.

Thanks and best regards

CodePudding user response：

It's not trivial to find out the exact reasons for this behavior, as they're dependant on a lot of different factors. My best advice is to look at the assembly. Intel VTune is great to understand what's going on inside the CPU.

I can speculate about two possible reasons for this difference:

Different vectorization behaviour of the compiler. The compiler might have generated efficient vectorized code for one case and not the other. You should look at the assembly and see if that might be the case.
Long dependecy chains. In the rows case, you're summing into one cell at a time. That means that your additions are all dependant on the previous additions. That can prevent the CPU from parallel execution of those additions. (Modern Intel CPUs can do something like 4 additions in one clock).

Also, have you tried using -march=native?

CodePudding user response：

g -O3 -std=c 20 -fopt-info-vec-all gives some insight and it appears that dropping rows doesn't allow for vectorization, but no reason is provided.

However, clang -O3 -std=c 20 -Rpass-analysis=loop-vectorize is more helpful by providing remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'. [-Rpass-analysis=loop-vectorize] y.reduction(x,[](double &a, const double &b){a =b;},drop::rows);

Indeed adding -ffast-math to the compiler options reverses the speed to 2.25 seconds for dropping columns and 1.5 seconds for dropping rows.