Home > database >  When multiple a relative big floating point value with two relative small floating point value, what
When multiple a relative big floating point value with two relative small floating point value, what

Time:03-25

The question description itself is pretty simple, let's say i have two variable, the big and scale, all i want to do is calculating:

float res = big * scale * scale;

As you can see, there exists two arithmetic order:

// #1
float res = (big * scale) * scale;
// #2
float res = big * (scale * scale);

Due to the fact of IEEE754 single precision standard, it is natural that the above two line will give different result.

Now i have some priori knowledge that big might vary from 0 to ~1000, scale might vary from 0 to 2^-10. The big is not that "big" that might "eat the small". And the scale is not that "small" to cause underflow when multiply themselves. That leaves my question is, which arithmetic order should i adopt to get a "smaller error", compared with the "real" value?

CodePudding user response:

From a precision standpoint - not much difference.

To avoid avoid underflow (product becomes 0.0), use (big * scale) * scale; as scale * scale may become 0.

I now see "... and the scale is not that "small" to cause underflow when multiply themselves. " - oh well.

CodePudding user response:

Order makes a possible small difference

Since the concern is precision and not range:

Consider the number of non-zero significant bits in big and scale. Each as floats may have up to 24 significant binary digits.

The idea is to perform the 2 multiplications and avoid 2 rounding. If possible, do multiplication exactly first. If big has fewer non-zero significant binary digits, then do big * scale first, else scale * scale.

As OP question hints that scale may be a power-of-2, then since that have only 1 significant binary digit, and if that holds for OP's code, then the order is irrelevant.

The exponent is irrelevant given OP's "scale is not that "small" to cause underflow when multiply themselves"

Since these are float, check FLT_EVAL_METHOD as the issue may be moot as intermediate calculations may occur with wider math.

  • Related