I have been comparing the run times of Intrinsics vector reduction, naive vector reduction and vector reduction using openmp pragmas. However, I found the result to be different in these scenarios. The code is as follows - (intrinsics vector reduction taken from - Fastest way to do horizontal SSE vector sum (or other reduction))
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>
inline float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float hsum256_ps_avx(__m256 v) {
__m128 vlow = _mm256_castps256_ps128(v);
__m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
vlow = _mm_add_ps(vlow, vhigh); // add the low 128
return hsum_ps_sse3(vlow); // and inline the sse3 version, which is optimal for AVX
// (no wasted instructions, and all of them are the 4B minimum)
}
void reduceVector_Naive(std::vector<float> values){
float result = 0;
for(int i=0; i<int(1e8); i ){
result = values.at(i);
}
printf("Reduction Naive = %f \n", result);
}
void reduceVector_openmp(std::vector<float> values){
float result = 0;
#pragma omp simd reduction( : result)
for(int i=0; i<int(1e8); i ){
result = values.at(i);
}
printf("Reduction OpenMP = %f \n", result);
}
void reduceVector_intrinsics(std::vector<float> values){
float result = 0;
float* data_ptr = values.data();
for(int i=0; i<1e8; i =8){
result = hsum256_ps_avx(_mm256_loadu_ps(data_ptr i));
}
printf("Reduction Intrinsics = %f \n", result);
}
int main(){
std::vector<float> values;
for(int i=0; i<1e8; i ){
values.push_back(1);
}
reduceVector_Naive(values);
reduceVector_openmp(values);
reduceVector_intrinsics(values);
// The result should be 1e8 in each case
}
However, my output is as follows -
Reduction Naive = 16777216.000000
Reduction OpenMP = 16777216.000000
Reduction Intrinsics = 100000000.000000
As it can be seen, only Intrinsic functions calculated it correctly and others are faced with precision issues. I am fully aware of the precision issues one might face using float due to rounding off, so my question is, why did the intrinsics get the answer correct even though it too in fact is floating value arithmetic.
I am compiling it as - g -mavx2 -march=native -O3 -fopenmp main.cpp
Tried with version 7.5.0
as well as 10.3.0
TIA
CodePudding user response:
Naïve loop adds by 1.0
, and it stops adding at 16777216.000000
, since there's not enough significant digits in binary32 float.
See this Q&A: Why does a float variable stop incrementing at 16777216 in C#?
When you add computed horizontal sum, it will add by 8.0
, so the number since when it will stop adding is about 16777216*8
= 134217728
, you just don't reach it in your experiment.