Home > Blockchain >  Is it better to perform n additions of a floating-point number of one integer multiplication?
Is it better to perform n additions of a floating-point number of one integer multiplication?

Time:10-29

Consider the two cases below:

// Case 1
double val { initial_value };
for (int i { 0 }; i < n;   i) {
    val  = step;
    foo(val);
}
// Case 2
for (int i { 0 }; i < n;   i) {
    double val = initial_value   i * step;
    foo(val);
}

where n is the number of steps, initial_value is some given value of type double, step is some predetermined value of type double and val is a variable used in subsequent call for the function foo. Which of the cases produces less of floating-point error? My guess would be the second one, as there are only one addition and multiplication, while the first case incurs the floating-point representation error from all of the n additions. I am asking this question because I didn't know what to search for. Does there exists some good reference for cases like these?

In practice the variable val is to be used in the loop of the both cases. I didn't include any example for this as I'm only interested in the floating-point error.

CodePudding user response:

Option 2 has significantly lower error.

How much? Well, let's assume an initial_value of 0 for simplicity sake at first. You have 53 significant bits, and how quickly you will see rounding errors depends how quickly we can manage to shift these off the far end during addition.

So let's pick step such that the significant bits are ideally all 1s: 0.999999999999999999999999.

Now the rounding error is log2(val/step) bits from the far end of step during each single addition. Not much during the first iteration, but the error becomes noticable rather quickly.

Picking a huge initial_value and the error can become quite extreme. For initial_value >= pow(2, 53) * step, your first loop even fails to change val at all in between iterations.

Your second loop still handles that correctly.

CodePudding user response:

Considering the comment by supercat (emphasis mine):

The point is that in many scenarios one might want a sequence of values that are uniformly spaced between specified start and end points. Using the second approach would yield values that are as uniformly spaced as possible between the start point and an end value that's near a desired one, but may not quite match.

And the one by Bathsheba:

Both are flawed. You should compute the start and end, then compute each value as a function of those. The problem with the second way is you multiply the error in step. The former accumulates errors.

I'd suggest a couple of alternatives.

  • Since C 20, the Standard Library provides std::lerp where std::lerp(a, b, t) returns "the linear interpolation between a and b for the parameter t (or extrapolation, when t is outside the range [0,1])".

  • A formula like value = (a * (n - i) b * i) / n; may result in a more uniform1 distribution of the intermediate values.

(1) Here I tried to test all those approaches for different extremes and number of sample points. The program compares the values generated by each algorithm when applied in the opposite directions (first from left to right, then from right to left). It shows the average and variance of the sum of the absolute difference between the values of the intermediate points.

Other metrics may yield different results.

CodePudding user response:

Consider an extreme case. Suppose that initial_value is much larger than step. Much, much larger. So large that initial_value step == initial_value due to the limits of floating point representation. However, we do not want this "extreme" case to get too extreme. Put a cap on initial_value, say keep it small enough to have initial_value (2*step) != initial_value. (Some people might call this putting step between a certain epsilon and half that epsilon, but I would get the terminology mixed up.) Now run through your code.

In the first loop, val will equal initial_value every iteration as no operation is performed that will change its value. In contrast, the second loop will eventually have a different value for val, if there are enough iterations. Therefore, the second option, the one that calculates initial_value i * step is more accurate in this extreme case.


We should also look at the opposite extremity. Suppose that initial_value is so small relative to step that initial_value step == step. In this case, initial_value might as well be zero, and the question simplifies to asking if there is a more accurate way to calculate i*step than by multiplying i and step. (If there is, I might want a new compiler.) Therefore, the second option is not worse than the first in this extreme case.


Extreme case analysis is not conclusive, but it often reveals trends. I pushed the calculation to opposite extremes, and the second option varied from definitely better to definitely not worse. I'd be willing to conclude that the second option produces less error.

Caveats: It might be that the size of the error is negligible and not worth coding around. Also, the question has limited scope, ignoring other considerations (such as from where step came; if it is the result of dividing by n, there might be even better alternatives). Still, in the narrow scenario presented by the question, calculating initial_value i*step each iteration looks like the way to get minimal numerical error.

  • Related