128 bit floating point binary representation error-CodePudding

Let's say we have some 128bit floating point number, for example x = 2.6 (1.3 * 2^1 ieee-754). I put in in union like this:

union flt {
        long double flt;
        int64_t byte8[OCTALC];
    } d;
d = x;

Then i run this to get it hexadecimal representation in memory:

void print_bytes(void *ptr, int size) 
{
    unsigned char *p = ptr;
    int i;
    for (i=0; i<size; i  ) {
        printf("hhX ", p[i]);
    }
    printf("\n");
}

// some where in the code
print_bytes(&d.byte8[0], 16);

And i get something like

66 66 66 66 66 66 66 A6 00 40 00 00 00 00 00 00

So by assumption i expect to see one of the leading bits(the left ones) to be 1(because exponent of 2.6 is 1) but in fact i see right bits to be 1(like it treating value big-endian). If i flip sign the output changes to:

66 66 66 66 66 66 66 A6 00 C0 00 00 00 00 00 00

So it seems like sign bit is righter than i thought. And if you count the bytes it seems like there is only 10 bytes used remaining 6 is like truncated or something. I trying to find out why this happens any help?

CodePudding user response：

You have a number of misconceptions.

First of all, you don't have a 128-bit floating point number. long double is probably a float in the x86 extended precision format on an x86-64. This is an 80 bit (10 byte) value, which is padded to 16 bytes. (I suspect this is for alignment purposes.)

And of course, it's going to be in little-endian byte order (since this is an x86/x86-64). This doesn't refer to the order of the bits in each byte, it refers to the order of the bytes in the whole.

And finally, the exponent is biased. An exponent of 1 isn't stored as 1. It's stored as 1 0x3FFF. This allows for negative exponents.

So we get the following:

66 66 66 66 66 66 66 A6 00 40 00 00 00 00 00 00

Demo on Compiler Explorer

If we remove the padding and reverse the bytes to better match the image in the Wikipedia page, we get

4000A666666666666666

This translates to

 0x1.4CCCCCCCCCCCCCCC × 2^(0x4000-0x3FFF)

(0xA66...6 = 0b1010 0110 0110...0110 ⇒ 0b1.0100 1100 1100...110[0] = 0x1.4CC...C)

 1.29999999999999999995663191310057982263970188796520233154296875 × 2^1

Decimal conversion obtained using

perl -Mv5.10 -e'
   use Math::BigFloat;
   Math::BigFloat->div_scale( 1000 );
   say
      Math::BigFloat->from_hex(  "4CCCCCCCCCCCCCCC" ) /
      Math::BigFloat->from_hex( "10000000000000000" )
'

perl -Mv5.10 -e'
   use Math::BigFloat;
   Math::BigFloat->div_scale( 1000 );
   say
      Math::BigFloat->from_hex( "A666666666666666" ) /
      Math::BigFloat->from_hex( "8000000000000000" )
'

CodePudding user response：

You've been bamboozled by some very strange aspects of the way extended-precision floating-point is typically implemented in C on Intel architectures. So don't feel too bad. :-)

What you're seeing is that although sizeof(long double) may be 16 (== 128 bits), deep down inside what you're really getting is the 80-bit Intel extended format. It's being padded out with 6 bytes, which in your case happen to be 0. So, yes, "the sign bit is righter than you thought".

I see the same thing on my machine, and it's something I've always wondered about. It seems like a real waste, doesn't it? I used to think it was for some kind of compatibility with machines which actually do have 128-bit long doubles. But that can't be it, because this 0-padded 16-byte format is not binary-compatible with true IEEE 128-bit floating point, among other things because the padding is on the wrong end.