In experimenting with rounding modes, FE_DOWNWARD
( and FE_TOWARDZERO
) apparently fails to form the expected sum of infinity and instead forms DBL_MAX
when adding DBL_MAX
and 1 ULP of DBL_MAX
.
Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX
values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST
and FE_UPWARD
.
Questions:
- Do you agree it is an error?
- Does code form the correct answer on another machine?
- This sadly follows another near
DBL_MAX
problem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.
Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG 2)/4;
int P = (LDBL_MANT_DIG 2)/4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf("mode: Addendum SUM (double) SUM (long double)\n");
for (int i = 0; i < n; i ) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp ", //
"ulp-", "ulp", "ulp "};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d ) {
double sum = max delta[d];
printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 0.0L max delta[d]);
}
puts("");
}
}
Output: note 4 unexpected lines.
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep 16383 1.18973149535723176502e 4932
DBL_MAX :0x1.fffffffffffffp 1023 1.7976931348623157e 308
DBL_MAX_ULP:0x1.0000000000000p 971 1.9958403095347198e 292
mode: Addendum SUM (double) SUM (long double)
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff7fep 1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000001p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffffffep 1023
mode:FE_DOWNWARD ulp :0x1.0000000000000p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024
--> inf expected <--
mode:FE_DOWNWARD ulp :0x1.0000000000001p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024
--> inf expected <--
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p 970 sum:inf 0x1.fffffffffffff800p 1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000001p 970 sum:inf 0x1.fffffffffffff800p 1023
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp 970 sum:inf 0x1.0000000000000000p 1024
mode:FE_TONEAREST ulp :0x1.0000000000000p 971 sum:inf 0x1.0000000000000000p 1024
mode:FE_TONEAREST ulp :0x1.0000000000001p 971 sum:inf 0x1.0000000000000000p 1024
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff7fep 1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000001p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffffffep 1023
mode:FE_TOWARDZERO ulp :0x1.0000000000000p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024
--> inf expected <--
mode:FE_TOWARDZERO ulp :0x1.0000000000001p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024
--> inf expected <--
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp 969 sum:inf 0x1.fffffffffffff800p 1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p 970 sum:inf 0x1.fffffffffffff800p 1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000001p 970 sum:inf 0x1.fffffffffffff802p 1023
mode:FE_UPWARD ulp- :0x1.fffffffffffffp 970 sum:inf 0x1.0000000000000000p 1024
mode:FE_UPWARD ulp :0x1.0000000000000p 971 sum:inf 0x1.0000000000000000p 1024
mode:FE_UPWARD ulp :0x1.0000000000001p 971 sum:inf 0x1.0000000000000002p 1024
CodePudding user response:
Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.
- The infinite-precision result of the computation is rounded, with rules based on the current rounding mode, but with no limits on the exponent. During this stage there's no such thing as "infinity", but the numbers can get as big as they need to get.
- If the result is outside the representable range, it is "corrected" to be within the representable range, in a manner based on the rounding mode. For
FE_NEAREST
(the "normal" mode) the number is corrected to infinity. ForFE_TOWARDZERO
, however, it's corrected to/-DBL_MAX
. (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to/-DBL_MAX
.)
The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably FE_NEAREST
is the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away than DBL_MAX
. But the basic behavior for the other modes is to output /-DBL_MAX
when its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.
Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is DBL_MAX
. The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".
CodePudding user response:
@Sneftel idea to use a 2 stage rounding model was useful and well explains why my expected result was amiss and the code was right.
Below, for reference, is augmented code that posts the overflow flag and helps illustrates the models application.
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
const char* exceptstr(fexcept_t *flag) {
static char buf[100];
buf[0] = 0;
if (*flag & FE_DIVBYZERO)
strcat(buf, STRINGIFY(FE_DIVBYZERO));
if (*flag & FE_INEXACT)
strcat(buf, STRINGIFY(FE_INEXACT));
if (*flag & FE_INVALID)
strcat(buf, STRINGIFY(FE_INVALID));
if (*flag & FE_OVERFLOW)
strcat(buf, STRINGIFY(FE_OVERFLOW));
if (*flag & FE_UNDERFLOW)
strcat(buf, STRINGIFY(FE_UNDERFLOW));
return buf;
}
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG 2) / 4;
int P = (LDBL_MANT_DIG 2) / 4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf(
"mode: Addendum SUM (double) SUM (long double) FE\n");
for (int i = 0; i < n; i ) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp ", //
"ulp-", "ulp", "ulp "};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d ) {
if (feclearexcept(FE_ALL_EXCEPT)) {
perror("feclearexcept()");
return -1;
}
///////////////////////////////
double sum = max delta[d];
///////////////////////////////
fexcept_t flag;
if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
perror("fegetexceptflag()");
return -1;
}
printf(
"mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max delta[d],
exceptstr(&flag));
}
puts("");
}
}
Output (Look to the far right)
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep 16383 1.18973149535723176502e 4932
DBL_MAX :0x1.fffffffffffffp 1023 1.7976931348623157e 308
DBL_MAX_ULP:0x1.0000000000000p 971 1.9958403095347198e 292
mode: Addendum SUM (double) SUM (long double) FE
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff7fep 1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000001p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffffffep 1023 FE_INEXACT
mode:FE_DOWNWARD ulp :0x1.0000000000000p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD ulp :0x1.0000000000001p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p 970 sum:inf 0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000001p 970 sum:inf 0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp 970 sum:inf 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp :0x1.0000000000000p 971 sum:inf 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp :0x1.0000000000001p 971 sum:inf 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp 969 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff7fep 1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000001p 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp 970 sum:0x1.fffffffffffffp 1023 0x1.fffffffffffffffep 1023 FE_INEXACT
mode:FE_TOWARDZERO ulp :0x1.0000000000000p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO ulp :0x1.0000000000001p 971 sum:0x1.fffffffffffffp 1023 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp 969 sum:inf 0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p 970 sum:inf 0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp :0x1.0000000000001p 970 sum:inf 0x1.fffffffffffff802p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp- :0x1.fffffffffffffp 970 sum:inf 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp :0x1.0000000000000p 971 sum:inf 0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp :0x1.0000000000001p 971 sum:inf 0x1.0000000000000002p 1024 FE_INEXACTFE_OVERFLOW