Do FMA instructions of different CPUs have different intermediate accuracy? If yes, then how does a-CodePudding

When I run fma-optimized horner-scheme polynomial computation (for cosine approximation), it makes 0.161 ulps error on FX8150 but 0.154 ulps on godbolt.org server despite the lack of -ffast-math (GCC).

If this is caused by hardware, and if accuracy is different per hardware, what does a C compiler do to maintain floating-point accuracy between different machines?

Is there only a minimum accuracy requirement by programming language specs so that any cpu vendor can boost precision as high as they want?

Minimal reproducible sample:

#include<iostream>
        // only optimized for [-1,1] input range
        template<typename Type, int Simd>
        inline
        void cosFast(Type * const __restrict__ data, Type * const __restrict__ result) noexcept
        {
            alignas(64)
            Type xSqr[Simd];
            
            for(int i=0;i<Simd;i  )
            {
                xSqr[i] =   data[i]*data[i];
            }   
            for(int i=0;i<Simd;i  )
            {
                result[i] =     Type(2.425144155360214881511638e-05);
            }
            for(int i=0;i<Simd;i  )
            {
                result[i] =     result[i]*xSqr[i]   Type(-0.001388599083010255696990498);
            }
            for(int i=0;i<Simd;i  )
            {
                result[i] =     result[i]*xSqr[i]   Type(0.04166657759826541962411284);
            }       
            for(int i=0;i<Simd;i  )
            {
                result[i] =     result[i]*xSqr[i]   Type(-0.4999999436679569697616898);
            }       
            for(int i=0;i<Simd;i  )
            {
                result[i] =     result[i]*xSqr[i]   Type(0.9999999821855363180134191);
            }


        }


#include<cstring>
template<typename T>
uint32_t GetUlpDifference(T a, T b)
{
    uint32_t aBitValue;
    uint32_t bBitValue;
    std::memcpy(&aBitValue,&a,sizeof(T));
    std::memcpy(&bBitValue,&b,sizeof(T));
    return (aBitValue > bBitValue) ?
           (aBitValue - bBitValue) :
           (bBitValue - aBitValue);
}
#include<vector>
template<typename Type>
float computeULP(std::vector<Type> real, std::vector<Type> approximation)
{
    int ctr = 0;
    Type diffSum = 0;
    for(auto r:real)
    {
        Type diff = GetUlpDifference(r,approximation[ctr  ]);
        diffSum  = diff;
    }
    return diffSum/ctr;
}

template<typename Type>
float computeMaxULP(std::vector<Type> real, std::vector<Type> approximation)
{
    int ctr = 0;
    Type mx = 0;
    int index = -1;
    Type rr = 0;
    Type aa = 0;
    for(auto r:real)
    {
        Type diff = GetUlpDifference(r,approximation[ctr  ]);
        if(mx<diff)
        {
            mx = diff;
            rr=r;
            aa=approximation[ctr-1];
            index = ctr-1;
        }
    }
    std::cout<<"("<<index<<":"<<rr<<"<-->"<<aa<<")";
    return mx;
}
#include<cmath>
void test()
{
    constexpr int n = 8192*64;
    std::vector<float> a(n),b(n),c(n);
    for(int i=0;i<n;i  )
        a[i]=(i-(n/2))/(float)(n/2);

    // approximation
    for(int i=0;i<n;i =16)
        cosFast<float,16>(a.data() i,b.data() i);

    // exact
    for(int i=0;i<n;i  )
        c[i] = std::cos(a[i]);
    
    std::cout<<"avg. ulps: "<<computeULP(b,c)<<std::endl;
    std::cout<<"max. ulps: "<<computeMaxULP(b,c)<<std::endl;
}

int main()
{
    test();
    return 0;
}

proof that it uses FMA:

https://godbolt.org/z/Y4qYMoxcn

.L23:
    vmovups ymm3, YMMWORD PTR [r12 rax]
    vmovups ymm2, YMMWORD PTR [r12 32 rax]
    vmulps  ymm3, ymm3, ymm3
    vmulps  ymm2, ymm2, ymm2
    vmovaps ymm1, ymm3
    vmovaps ymm0, ymm2
    vfmadd132ps     ymm1, ymm7, ymm8
    vfmadd132ps     ymm0, ymm7, ymm8
    vfmadd132ps     ymm1, ymm6, ymm3
    vfmadd132ps     ymm0, ymm6, ymm2
    vfmadd132ps     ymm1, ymm5, ymm3
    vfmadd132ps     ymm0, ymm5, ymm2
    vfmadd132ps     ymm1, ymm4, ymm3
    vfmadd132ps     ymm0, ymm4, ymm2
    vmovups YMMWORD PTR [r13 0 rax], ymm1
    vmovups YMMWORD PTR [r13 32 rax], ymm0
    add     rax, 64
    cmp     rax, 2097152
    jne     .L23

this instance (I don't know if it was xeon or epyc) further improved it to 0.152 ulps average.

CodePudding user response：

Regarding the C language, there are no strong requirements and it is mostly implementation-defined as stated in the previous answer pointed out by @Maxpm in the comments.

The main standard for floating-point precision is the IEEE-754. It is generally properly implemented by most vendors nowadays (at least nearly all recent mainstream x86-64 CPUs and most mainstream GPUs). It is not required by the C standard, but you can check this with std::numeric_limits<T>::is_iec559.

The IEEE-754 standard requires operations to be correctly computed (ie. error less than 1 ULP) using the correct rounding method. There are different rounding methods supported by the norm but the most common is the round to nearest. The standard also require some operation like FMA to be implemented with the same requirements. As a result, you cannot expect results to be computed with a precision better than 1 ULP per operation with this standard (rounding may help to reach 0.5 ULP in average or even better regarding the actual algorithm used).

In practice, computing units of IEEE-754-compliant hardware vendors use a higher precision internally so to fulfil the requirements whatever the provided input. Still, when results are stored in memory, then they need to be rounded properly the way the IEEE-754. On x86-64 processors, SIMD registers like the one of SSE, AVX and AVX-512 have a well-known fixed size. Each lane are either 16-bit (half-float), 32-bit (float) or 64-bit (double) for floating-point operations. The IEEE-754 compliant rounding should applied for each instructions. While processors could theoretically implement clever optimizations like fusing two FP instructions in one (as long as the precision is <1 ULP), AFAIK none does that yet (though fusing is done for some instructions like conditional branches).

Difference between IEEE-754 platforms can be due to either to the compiler or the configuration of FP units of the hardware vendor.

Regarding the compiler, optimizations can increase the precision while being IEEE-754 compliant. For example, the use of FMA instruction in your code is an optimization that increase the precision of the result but it is not mandatory for the compiler to do that on x86-64 platforms (in fact, not all x86-64 processors support it). Compilers may use separate multiply add instructions for some reasons (Clang sometime does that). A compiler can pre-compute some constants using a better precision than the target processor (GCC for example operate on FP numbers with a much higher precision to generate compile-time constants). Additionally, different rounding methods could be used to compute constants.

Regarding the hardware vendor, as the default rounding mode can change from one platform to another. In your case, the very small difference may be due to that. The rounding mode could be "Round to nearest, ties to even" on one platform and "Round to nearest, ties away from zero" on the other platform, resulting in a very small, but visible, difference. You can set the rounding mode using the C code provided in this answer. Note also that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more informations) though it makes the results not IEEE-754 compliant. You should check if this is the case.

Put it shortly, difference <1 ULP is totally normal between two IEEE-754-compliant platforms and it is actually quite frequent between very different platforms (eg. Clang on POWER vs GCC on x86-64).