The question description itself is pretty simple, let's say i have two variable, the big
and scale
, all i want to do is calculating:
float res = big * scale * scale;
As you can see, there exists two arithmetic order:
// #1
float res = (big * scale) * scale;
// #2
float res = big * (scale * scale);
Due to the fact of IEEE754 single precision standard, it is natural that the above two line will give different result.
Now i have some priori knowledge that big
might vary from 0
to ~1000
, scale
might vary from 0
to 2^-10
. The big
is not that "big" that might "eat the small". And the scale
is not that "small" to cause underflow when multiply themselves. That leaves my question is, which arithmetic order should i adopt to get a "smaller error", compared with the "real" value?
CodePudding user response:
From a precision standpoint - not much difference.
To avoid avoid underflow (product becomes 0.0), use (big * scale) * scale;
as scale * scale
may become 0.
I now see "... and the scale is not that "small" to cause underflow when multiply themselves. " - oh well.
CodePudding user response:
Order makes a possible small difference
Since the concern is precision and not range:
Consider the number of non-zero significant bits in big
and scale
. Each as float
s may have up to 24 significant binary digits.
The idea is to perform the 2 multiplications and avoid 2 rounding. If possible, do multiplication exactly first. If big
has fewer non-zero significant binary digits, then do big * scale
first, else scale * scale
.
As OP question hints that scale
may be a power-of-2, then since that have only 1 significant binary digit, and if that holds for OP's code, then the order is irrelevant.
The exponent is irrelevant given OP's "scale is not that "small" to cause underflow when multiply themselves"
Since these are float
, check FLT_EVAL_METHOD
as the issue may be moot as intermediate calculations may occur with wider math.