Home > Software design >  Floating point multiplication results in infinity even though rounding to nearest
Floating point multiplication results in infinity even though rounding to nearest

Time:02-01

Consider the following C code:

#include <fenv.h>
#include <iostream>
using namespace std;

int main(){
    fesetround(FE_TONEAREST);
    double a = 0x1.efc7f0001p 376;
    double b = -0x1.0fdfdp 961;
    double c = a*b;
    cout << a << " "<< b << " " << c << endl;
}

The output that I see is

2.98077e 113 -2.06992e 289 -inf

I do not understand why c is infinity. My understanding is that whatever the smallest non-infinity floating point value is, it should be closer to the actual value of a*b than -inf as the minimum non-infinity floating point number is finite and any finite number is closer to any other finite number than negative infinity. Why is infinity outputted here?

This was run on 64bit x86 and the assembly uses SSE instructions. It was compiled with -O0 and happens both with clang and gcc.

The result is the minimum finite floating point if the round towards zero mode is used for floating points. I conclude that the issue is rounding related.

CodePudding user response:

Rounding is not the primary issue here. The infinite result is caused by overflow.

This answer follows the rules in IEEE 754-2019. The real-number-arithmetic product of 1.EFC7F000116•2376 and −1.0FDFD961•216 is around −1.07430C8649FEFE816•21335. In normal conditions, floating-point arithmetic produces a result “as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result…” (IEEE 754-2019 4.3). However, we do not have normal conditions. IEEE 754-2019 7.4 says:

The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (see 4) were the exponent range unbounded…

In other words, if we rounded the result as if we could have any exponent (so we are just rounding the significand), the result would be −1.07430C8649FF 96416•21338. But the magnitude of that exceeds the largest finite number that double can represent, ±1.FFFFFFFFFFFFF16•21023. Therefore, an overflow exception is signaled, and, since you do not catch the exception, a default result is delivered:

… The default result shall be determined by the rounding-direction attribute and the sign of the intermediate result as follows:

a) roundTiesToEven and roundTiesToAway carry all overflows to ∞ with the sign of the intermediate result…

CodePudding user response:

The behaviour you note (when using the FE_TONEAREST rounding mode) conforms to the IEEE-754 (or the equivalent ISO/IEC/IEEE 60559:2011) Standard. (Your examples imply that your platform uses IEEE-754 representation, as many – if not most – platforms do, these days.)

From this Wikipedia page, footnote #18, which cites IEEE-754 §4.3.1, dealing with the "rounding to nearest" modes:

In the following two rounding-direction attributes, an infinitely precise result with magnitude at least bemax(b-½b(1-p)) shall round to (infinity) with no change in sign.

The 'infinitely precise' result of your a * b calculation does, indeed, have a magnitude greater than the specified value, so the rounding is Standard-conformant.


I can't find a similar IEEE-754 citation for the "round-towards-zero" mode, but this GNU libc manual has this to say:

Round toward zero.
All results are rounded to the largest representable value whose magnitude is less than that of the result. In other words, if the result is negative it is rounded up; if it is positive, it is rounded down.

So, again, when using that mode, rounding to -DBL_MAX is appropriate/conformant.

  • Related