Picking a scale for floating point values in an array, does it matter for precision of arithmetic-CodePudding

I have a large array of floating point values that vary widely in magnitude. Does it help rescaling those in [0,1] for precision purposes (e.g. if I want to perform arithmetic operations on the array)? I can think of the smaller values getting truncated if I do so, but on the other hand small values will not contribute much to the absolute error. If I do the rescaling on an array of already computed values, I believe this can only make things worse as I would only introduce additional round-off error. On the other hand, I believe I can decrease the error if the scaling is instead involved at the point when I generate said values.

I am mainly referring to the fact that absolute distance between consecutive error grows 2 times for values in subsequent intervals (i.e. [0,1) vs [1,2) vs [2,4), etc.). Am I interpreting this correctly in the current context? I have seen such effect of floating point errors due to large scaling when trying to render a massively scaled 3D scene versus a less scaled version of it (similar effects occur when a camera in 3D space is too far from the origin, since absolute distances between floats become larger).

Considering the above, is there an optimal way to choose the scaling factor for an array of values I plan to generate (provided I know what the minimum and maximum will be without scaling). I was thinking of just generating it so that all values are within [0,1], however I was worried that the truncation of the smallest element may be an issue. Are there known heuristics based on the largest and smallest elements that allow to find a semi-optimal rescaling wrt precision. On an unrelated note, I am aware of the Kahan summation algorithm and its variants and I do use it for the summation of said array. My question is rather whether a choice of scale can help further, or will this not matter?

CodePudding user response：

Scaling by powers of two in a binary floating-point format (or, generally, by powers of b in a base-b floating-point format) has no error as long as the results stay within normal exponent bounds. That is, for any x, the result of computing x•b^e has the same significand as x, as long as x•b^e is in the normal range.

Again as long as results stay in the normal range, the operations of adding, subtracting, multiplying, and dividing scaled numbers produce results with identical significands as the same operations on unscaled numbers that stay in the normal range. Any rounding errors that occur in the unscaled operations are identical to the rounding errors in the scaled operations, as adjusted by the scale.

Therefore, scaling numbers by a power of b, performing the same operations, and undoing the scaling will not improve or alter floating-point rounding errors. (Note that multiplications and divisions will affect the scaling, and this can be compensated for either after each operation, after all the operations, or periodically. For example, given X = x*16 and Y = y*16, X*Y would equal x*16*y*16 = x*y*256. So undoing its scaling requires dividing by 256 rather than 16.)

If other operations are used, the rounding errors may differ. For example, if a square root is performed and the scaling in its operand is not an even power of b, its result will include a scaling that is not an integral power of b, and so the significand must be different from the significand of the corresponding unscaled result, and that allows the rounding errors to be different.

Of course, if sines, cosines, or other trigonometric functions are used on scaled numbers, drastically different results will be obtained, as these functions do not scale in the required way (f(x•s) generally does not equal f(x)•s). However, if the numbers that are being scaled represent points in space, any angles computed between them would be identical in the scaled and unscaled implementations. That is, the computed angles would be free of scaling, and so applying trigonometric functions would produce identical results.

If any intermediate results exceed the normal exponent range in either the scaled or the unscaled computations, then different significands may be produced. This includes the case where the results are subnormal but have not underflowed to zero—subnormal results may have truncated signficands, so some information is lost compared to a differently-scaled computation that produces a result in the normal range.

An alternative to scaling may be translation. When working with points from the origin, the coordinates may be large, and the floating-point resolution may be large relative to distances between the points. If the points are translated to near the origin (a fixed amount is subtracted from each coordinate [fixed per dimension]), the geometric relationships between them are preserved, but the coordinates will be in a finer range of the floating-point format. This can improve the floating-point rounding errors that occur.