Home > OS >  How can I know in advance which real numbers would have an imprecise representation using float vari
How can I know in advance which real numbers would have an imprecise representation using float vari

Time:02-15

I know that the number 159.95 cannot be precisely represented using float variables in C.

For example, considering the following piece of code:

#include <stdio.h>
int main()
{
    float x = 159.95;
    printf("%f\n",x);
    return 0;
}

It outputs 159.949997.

I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.

Best regards.

CodePudding user response:

Usually, a float is an IEEE754 binary32 float (this is not guaranteed by spec and may be different on some compilers/systems). This data type specifies a 24-bit significand; this means that if you write the number in binary, it should require no more than 24 bits excluding trailing zeros.

159.95's binary representation is 10011111.11110011001100110011... with repeating 0011 forever, so it requires an infinite number of bits to represent precisely with a binary format.

Other examples:

1073741760 has a binary representation of 111111111111111111111111000000. It has 30 bits in that representation, but only 24 significant bits (since the remainder are trailing zero bits). It has an exact float representation.

1073741761 has a binary representation of 111111111111111111111111000001. It has 30 significant bits and cannot be represented exactly as a float.

0.000000059604644775390625 has a binary representation of 0.000000000000000000000001. It has one significant bit and can be represented exactly.

0.750000059604644775390625 has a binary representation of 0.110000000000000000000001, which is 24 significant bits. It can be represented exactly as a float.

1.000000059604644775390625 has a binary representation of 1.000000000000000000000001, which is 25 significant bits. It cannot be represented exactly as a float.

Another factor (which applies to very large and very small numbers) is that the exponent is limited to the -126 to 127 range. With some handwaving around denormal values and other special cases, this generally allows values ranging from roughly 2-126 to slightly under 2128.

CodePudding user response:

Succinctly, for the format most commonly used for float, a number is exactly representable if and only if it is representable as an integer F times a power of two, 2E such that:

  • the magnitude of F is less than 224, and
  • –149 ≤ E < 105.

More generally, C 2018 5.2.4.2.2 specifies the characteristics of floating-point types. A floating-point number is represented as sbe•sum(fk bk, 1 ≤ kp), where:

  • s is a sign, 1 or −1,
  • b is a fixed base chosen by the C implementation, often 2,
  • e is an exponent, which is an integer between a minimum emin and a maximum emax, chosen by the C implementation,
  • p is the precision, the number of base-b digits in the significand, and
  • fk are digits in base-b, nonnegative integers less than b.

The significand is the fraction portion of the representation, sum(fk bk, 1 ≤ kp). It is written as a sum so that we can express the variable number of digits it may have. (p is a variable set by the C implementation, not by the programmer using the C implementation.) When we write it out a significand in base b, it can be a numeral, such as .0011101010011001010101102 for a 24-bit significand in base 2. Note that, in the this form (and the sum), the significand has all its digits after the radix point.

To make it slightly easier to tell if a number is in this format, we can adjust the scale so the significand is an integer instead of having digits after the radix point: sbep•sum(fk bpk, 1 ≤ kp). This changes the above significand from .0011101010011001010101102 to 0011101010011001010101102. Since it has p digits, it is always a non-negative integer less than bp.

Now we can figure out if a finite number is representable in this format:

  • Get b, p, emin, and emax for the target C implementation. If it uses IEEE-754 binary32 for float, then b is 2, p is 24, emin is −125, and emax is 128. When <float.h> is included, these are defined as FLT_RADIX, FLT_MANT_DIGITS, FLT_MIN_EXP, and FLT_MAX_EXP.
  • Ignore the sign. Write the absolute value of number as a rational number n/d in simplest form. If it is an integer, let d be 1.
  • If d is not a power of b, the number is not representable in the format.
  • If n is a multiple of b greater than or equal to bp, divide it by b and multiply d by d until n is not a multiple or is less than bp.
  • If n is greater than or equal to bp, the number is not representable in the format.
  • Let e be such that d = bep. If emineemax, the number is representable in the format. Otherwise, it is not.

Some floating-point formats might not support subnormal numbers, in which f1 is zero. This is indicated by FLT_HAS_SUBNORM being defined to be zero and would require modifications to the above.

CodePudding user response:

I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.

In general, floating point numbers can only represent numbers whose denominator is a power of 2.

To check if a number can be represented as floating point value (of any floating-point type) at all, take the decimal digits after the decimal point, interpret them as number and check if they can be divided by 5^n while n is the number of digits:

159.95 => 95, 2 digits => 95%(5*5) = 20 => Cannot be represented as floating-point value

Counterexample:

159.625 => 625, 3 digits => 625%(5*5*5) = 0 => Can be represented as floating-point value

You also have to consider the fact that floating-point values only have a limited number of digits after the decimal point:

In principle, 123456789 can be represented by a floating-point value exactly (it is an integer), however float does not have enough bits!

To check if an integer value can be represented by float exactly, divide the number by 2 until the result is odd. If the result is < 2^24, the number can be represented by float exactly.

In the case of a rational number, first do the "divisible by 5^n" check described above. Then multiply the number by 2 until the result is an integer. Check if it is < 2^24.

CodePudding user response:

I would like to know if there is some way to know in advance which real value... would be represented in an imprecise way

The short and only partly facetious answer is... all of them!

There are roughly 2^32 = 4294967296 values of type float. And there are an uncountably infinite number of real numbers. So, for a randomly-chosen real number, the chance that it can be exactly represented as a value of type float is 4294967296/∞, which is 0.

If you use type double, there are approximately 2^64 = 18446744073709551616 of those, so the chance that a randomly-chosen real number can be exactly represented as a double is 18446744073709551616/∞, which is again... 0.

I realize I'm not answering quite the question you asked, but in general, it's usually a bad idea to use binary floating-point types as if they were an exact representation of decimal fractions. Attempts to assume that they're ever an exact representation usually lead to trouble. In general, it's best to assume that floating-point types are an imperfect (approximate) realization of of real numbers, period (that is, without assuming decimal). If you never assume they're exact (which for true real numbers, they virtually never are), you'll never get into trouble in cases where you thought they'd be exact, but they weren't.

[Footnote 1: As Eric P. reminds in a comment, there's no such thing as a "randomly-chosen real number", which is why this is a partially facetious answer.]

[Footnote 2: I now see your comment where you say that you do assume they are all imprecise, but that you would "like to understand the phenomenon in a deeper way", in which case my answer does you no good, but hopefully some of the others do. I can especially commend Martin Rosenau's answer, which goes straight to the heart of the matter: a rational number is representable exactly in base 2 if and only if its reduced denominator is a pure power of 2, or stated another way, has only 2's in its prime factorization. That's why, if you take any number you can actually store in a float or double, and print it back out using %f and enough digits, with a properly-written printf, you'll notice that the numbers always end in things like ...625 or ...375. Binary fractions are like the English rulers still used in the U.S.: everything is halves and quarters and eights and sixteenths and thirty-seconds and sixty-fourths.]

CodePudding user response:

You can decompose it bit by bit and make the sum. I have answered this question here.

The idea is, to decompose a float in sign, exponent, fraction parts and use high precision arithmetic library (example, GNU Multiple Precision Arithmetic Library) to make the sum.

  • Related