Write double bitpattern as a literal-CodePudding

I have the hex literal 400D99999999999A which is the bitpattern for 3.7 as a double
How do I write this in C? I saw this page about floating_literal and hex. Maybe it's obvious and I need to sleep but I'm not seeing how to write the bitpattern as a float. I understand it's suppose to let a person write a more precise fraction but I'm not sure how to translate a bitpattern to a literal

#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
    double d = 0x400D99999999999Ap0;
    printf("%f\n", d); //incorrect
    
    unsigned long l = 0x400D99999999999A;
    memcpy(&d, &l, 8);
    printf("%f\n", d); //correct, 3.7
    return 0;
}

CodePudding user response：

The value you're trying to use is an IEEE bit pattern. C doesn't support this directly. To get the desired bit pattern, you need to specify the mantissa, as an ordinary hex integer, along with a power-of-two exponent.

In this case, the desired IEEE bit pattern is 400D99999999999A. If you strip off the sign bit and the exponent, you're left with D99999999999A. There's an implied leading 1 bit, so to get the actual mantissa value, that needs to be explicitly added, giving 1D99999999999A. This represents the mantissa as an integer with no fractional part. It then needs to be scaled, in this case by a power-of-two exponent value of -51. So the desired constant is:

    double d = 0x1D99999999999Ap-51;

If you plug this into your code, you will get the desired bit pattern of 400D99999999999A.

CodePudding user response：

how to write the bitpattern as a float.

The bit pattern 0x400D99999999999A commonly encodes (alternates encodings exists) the double with a value of about 3.7^*1.

double d;
unsigned long l = 0x400D99999999999A;
// Assume same size, same endian
memcpy(&d, &l, 8);
printf("%g\n", d);
// output 3.7

To write the value out using "%a" format with a hexadecimal significand, decimal power-of-2 exponent

printf("%g\n", d);
// output 0x1.d99999999999ap 1

The double constant (not literal) 1.d99999999999ap 1 has an explicit 1 bit and and the lower 52-bits of 0x400D99999999999A and the exponent of 1 is like the biased exponent bits (12 most significant bits, expect the sign bit) biased by 0x400 - 1.

Now code can use double d = 0x1.d99999999999ap 1 instead of the memcpy() to initialize d.

^*1 Closest double to 3.7 is exactly
3.7000000000000001776356839400250464677810668945312

CodePudding user response：

The following program shows how to interpret a string of bits as a double, using either the native double format or using the IEEE-754 double-precision binary format (binary64).

#include <math.h>
#include <stdint.h>
#include <string.h>


//  Create a mask of n bits, in the low bits.
#define Mask(n) (((uint64_t) 1 << (n)) - 1)


/*  Given a uint64_t containing 64 bits, this function interprets them in the
    native double format.
*/
double InterpretNativeDouble(uint64_t bits)
{
    double result;
    _Static_assert(sizeof result == sizeof bits, "double must be 64 bits");

    //  Copy the bits into a native double.
    memcpy(&result, &bits, sizeof result);

    return result;
}


/*  Given a uint64_t containing 64 bits, this function interprets them in the
    IEEE-754 double-precision binary format.  (Checking that the native double
    format has sufficient bounds and precision to represent the result is
    omitted.  For NaN results, a NaN is returned, but the signaling
    characteristic and the payload bits are not supported.)
*/
double InterpretDouble(uint64_t bits)
{
    /*  Set some parameters of the format.  (This routine is not fully
        parameterized for all IEEE-754 binary formats; some hardcoded constants
        are used.)
    */
    static const int Emax      = 1023;  //  Maximum exponent.
    static const int Precision =   53;  //  Precision (number of digits).

    //  Separate the fields in the encoding.
    int SignField             = bits >> 63;
    int ExponentField         = bits >> 52 & Mask(11);
    uint64_t SignificandField = bits & Mask(52);

    //  Interpret the exponent and significand fields.
    int Exponent;
    double Significand;

    switch (ExponentField)
    {
        /*  An exponent field of all zero bits indicates a subnormal number,
            for which the exponent is fixed at its minimum and the leading bit
            of the significand is zero.  This includes zero, which is not
            classified as a subnormal number but is consistent in the encoding.
        */
        case 0:
            Exponent = 1 - Emax;
            Significand = 0   ldexp(SignificandField, 1-Precision);
                //  ldexp(x, y) computes x * pow(2, y).
            break;

        /*  An exponent field of all one bits indicates a NaN or infinity,
            according to whether the significand field is zero or not.
        */
        case Mask(11):
            Exponent = 0;
            Significand = SignificandField ? NAN : INFINITY;
            break;

        /*  All other exponent fields indicate normal numbers, for which the
            exponent is encoded with a bias (equal to EMax) and the leading bit
            of the significand is one.
        */
        default:
            Exponent = ExponentField - Emax;
            Significand = 1   ldexp(SignificandField, 1-Precision);
            break;
    }

    //  Combine the exponent and significand,.
    Significand = ldexp(Significand, Exponent);

    //  Interpret the sign field.
    if (SignField)
        Significand = -Significand;

    return Significand;
}


#include <stdio.h>
#include <inttypes.h>


int main(void)
{
    uint64_t bits = 0x400D99999999999A;

    printf("The bits 0x" PRIx64 " interpreted as:\n", bits);
    printf("\ta native double represent %.9999g, and\n",
        InterpretNativeDouble(bits));
    printf("\tan IEEE-754 double-precision datum represent %.9999g.\n",
        InterpretDouble(bits));
}

CodePudding user response：

As an IEEE-754 double-precision value, that bit pattern 400D99999999999A actually consists of three parts:

the first bit, 0, is the sign;
the next 11 bits, 10000000000 or 0x400, are the exponent; and
the remaining 52 bits, 0xD99999999999A, are the significand (also known as the "mantissa").

But the exponent has a bias of 1023 (0x3ff), so numerically it's 0x400 - 0x3ff = 1. And the significand is all fractional, and has an implicit 1 bit to its left, so it's really 0x1.D99999999999A.

So the actual number this represents is

0x1.D99999999999A × 2¹

which is about 1.85 × 2, or 3.7.

Or, using C's "hex float" or %a representation, it's 0x1.D99999999999Ap1.

In "hex float" notation, the leading 1 and the decimal point (really a "radix point") are explicit, and the p at the end indicates a power-of-two exponent.

Although the decomposition I've shown here may seem reasonably straightforward, actually writing code to reliably decompose a 64-bit number like 400D99999999999A into its three component parts, and manipulate and recombine them to determine what floating-point value they represent (or even to form an equivalent hex float constant like 0x1.D99999999999Ap1) can be surprisingly tricky. See Eric Postpischil's answer for more of the details.