Why addition error near `DBL_MAX` and infinity?-CodePudding

In experimenting with rounding modes, FE_DOWNWARD ( and FE_TOWARDZERO) apparently fails to form the expected sum of infinity and instead forms DBL_MAX when adding DBL_MAX and 1 ULP of DBL_MAX.

Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST and FE_UPWARD.

Questions:

Do you agree it is an error?
Does code form the correct answer on another machine?
This sadly follows another near DBL_MAX problem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.

Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP

#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
      STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG   2)/4;
  int P = (LDBL_MANT_DIG   2)/4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf("mode:               Addendum                          SUM (double)                 SUM (long double)\n");
  for (int i = 0; i < n; i  ) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
        nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
        max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp ", //
        "ulp-", "ulp", "ulp "};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d  ) {
      double sum = max   delta[d];
      printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
          modes[i],  deltas[d], p, delta[d], p, sum, P, 0.0L   max   delta[d]);
    }
    puts("");
  }
}

Output: note 4 unexpected lines.

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep 16383 1.18973149535723176502e 4932
DBL_MAX    :0x1.fffffffffffffp 1023  1.7976931348623157e 308
DBL_MAX_ULP:0x1.0000000000000p 971   1.9958403095347198e 292

mode:               Addendum                          SUM (double)                 SUM (long double)
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff7fep 1023
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000001p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffffffep 1023
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024
                                                      --> inf expected <--
mode:FE_DOWNWARD    ulp     :0x1.0000000000001p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024
                                                      --> inf expected <--

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p 970   sum:inf                      0x1.fffffffffffff800p 1023
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000001p 970   sum:inf                      0x1.fffffffffffff800p 1023
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp 970   sum:inf                      0x1.0000000000000000p 1024
mode:FE_TONEAREST   ulp     :0x1.0000000000000p 971   sum:inf                      0x1.0000000000000000p 1024
mode:FE_TONEAREST   ulp     :0x1.0000000000001p 971   sum:inf                      0x1.0000000000000000p 1024

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff7fep 1023
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000001p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffffffep 1023
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024
                                                      --> inf expected <--
mode:FE_TOWARDZERO  ulp     :0x1.0000000000001p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024
                                                      --> inf expected <--

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp 969   sum:inf                      0x1.fffffffffffff800p 1023
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p 970   sum:inf                      0x1.fffffffffffff800p 1023
mode:FE_UPWARD      0.5 ulp :0x1.0000000000001p 970   sum:inf                      0x1.fffffffffffff802p 1023
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp 970   sum:inf                      0x1.0000000000000000p 1024
mode:FE_UPWARD      ulp     :0x1.0000000000000p 971   sum:inf                      0x1.0000000000000000p 1024
mode:FE_UPWARD      ulp     :0x1.0000000000001p 971   sum:inf                      0x1.0000000000000002p 1024

CodePudding user response：

Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.

The infinite-precision result of the computation is rounded, with rules based on the current rounding mode, but with no limits on the exponent. During this stage there's no such thing as "infinity", but the numbers can get as big as they need to get.
If the result is outside the representable range, it is "corrected" to be within the representable range, in a manner based on the rounding mode. For FE_NEAREST (the "normal" mode) the number is corrected to infinity. For FE_TOWARDZERO, however, it's corrected to /-DBL_MAX. (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to /-DBL_MAX.)

The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably FE_NEAREST is the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away than DBL_MAX. But the basic behavior for the other modes is to output /-DBL_MAX when its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.

Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is DBL_MAX. The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".

CodePudding user response：

@Sneftel idea to use a 2 stage rounding model was useful and well explains why my expected result was amiss and the code was right.

Below, for reference, is augmented code that posts the overflow flag and helps illustrates the models application.

#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

const char* exceptstr(fexcept_t *flag) {
  static char buf[100];
  buf[0] = 0;
  if (*flag & FE_DIVBYZERO)
    strcat(buf, STRINGIFY(FE_DIVBYZERO));
  if (*flag & FE_INEXACT)
    strcat(buf, STRINGIFY(FE_INEXACT));
  if (*flag & FE_INVALID)
    strcat(buf, STRINGIFY(FE_INVALID));
  if (*flag & FE_OVERFLOW)
    strcat(buf, STRINGIFY(FE_OVERFLOW));
  if (*flag & FE_UNDERFLOW)
    strcat(buf, STRINGIFY(FE_UNDERFLOW));
  return buf;
}

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
  STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG   2) / 4;
  int P = (LDBL_MANT_DIG   2) / 4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
  LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf(
      "mode:               Addendum                          SUM (double)                 SUM (long double)          FE\n");
  for (int i = 0; i < n; i  ) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
    nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
    max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp ", //
        "ulp-", "ulp", "ulp "};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d  ) {
      if (feclearexcept(FE_ALL_EXCEPT)) {
        perror("feclearexcept()");
        return -1;
      }
      ///////////////////////////////
      double sum = max   delta[d];
      ///////////////////////////////
      fexcept_t flag;
      if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
        perror("fegetexceptflag()");
        return -1;
      }

      printf(
          "mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
          modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max   delta[d],
          exceptstr(&flag));
    }
    puts("");
  }
}

Output (Look to the far right)

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep 16383 1.18973149535723176502e 4932
DBL_MAX    :0x1.fffffffffffffp 1023  1.7976931348623157e 308
DBL_MAX_ULP:0x1.0000000000000p 971   1.9958403095347198e 292

mode:               Addendum                          SUM (double)                 SUM (long double)          FE
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff7fep 1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000001p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffffffep 1023 FE_INEXACT
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD    ulp     :0x1.0000000000001p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p 970   sum:inf                      0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000001p 970   sum:inf                      0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp 970   sum:inf                      0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp     :0x1.0000000000000p 971   sum:inf                      0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp     :0x1.0000000000001p 971   sum:inf                      0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp 969   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff7fep 1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000001p 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffff800p 1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp 970   sum:0x1.fffffffffffffp 1023  0x1.fffffffffffffffep 1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO  ulp     :0x1.0000000000001p 971   sum:0x1.fffffffffffffp 1023  0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp 969   sum:inf                      0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p 970   sum:inf                      0x1.fffffffffffff800p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp :0x1.0000000000001p 970   sum:inf                      0x1.fffffffffffff802p 1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp 970   sum:inf                      0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp     :0x1.0000000000000p 971   sum:inf                      0x1.0000000000000000p 1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp     :0x1.0000000000001p 971   sum:inf                      0x1.0000000000000002p 1024 FE_INEXACTFE_OVERFLOW