Home > Blockchain >  IEEE 754 Addition of two 32-bit floating point numbers (-1 and 2^(-50) )
IEEE 754 Addition of two 32-bit floating point numbers (-1 and 2^(-50) )

Time:10-30

Consider the following piece of C Code:

#include <iostream>
#include <cmath>

using namespace std;

int main()
{
    cout.precision(1000000000);
    
    float a,b,c;
    
    a = 1;
    b = -1;
    c = pow(2, -50);
    
    cout << "a = " << a << endl;
    cout << "b = " << b << endl;
    cout << "c = " << c << endl;
    
    float ab = a   b;
    float bc = b   c;
    float abc = ab   c;
    float bca = bc   a;
    
    cout << "a   b = " << ab << endl;
    cout << "b   c = " << bc << endl;
    cout << "(a   b)   c = " << abc << endl;
    cout << "(b   c)   a = " << bca << endl;

    return 0;
}

Which yields the output:

a = 1
b = -1
c = 8.8817841970012523233890533447265625e-16
a   b = 0
b   c = -1
(a   b)   c = 8.8817841970012523233890533447265625e-16
(b   c)   a = 0

Why is b c = -1?

I am not getting my head around this effect of the IEEE 754 standard.

To my understanding the exponent ranges from -126 to 127. (8 bit for the biased exponent with a bias of 127.)

So 2^(-50) is representable without an issue as is 1 or -1. Neither of them are subnormal (denormalized) numbers, if I understand the standard correctly.

But why does the addition of -1 2^(-50) result in -1, thus the smaller number being neglected?

Thanks in advance for any help!

CodePudding user response:

The IEEE 754 standard specifies 1 sign bit, 7 exponent bits and 24 bits for the mantissa. When performing addition, the mantissas of each number get normalized, so 2^-50 is 1 shifted right by 50 bits relative to 1. This causes it to fall outside of the 24 bit mantissa used for the result. You should try repeating your experiment with 2^-25 to prove this.

CodePudding user response:

You are using float which is (at least) single precision. Use double instead.

And -1 9e-16 is within roundoff of -1 in single precision.

  • Related