How to reduce the float rounding error when converting it into fixed-point in C ?-CodePudding

I have a float variable which is incremented 0.1 in each step. I want to convert it into 16-bit fixed value where it has 5-bits fractional part. In order to do that I have the code snippet below:

#include <iostream>
#include <bitset>
#include <string>

using namespace std;

int main() {
    bitset<16> mybits;
    string mystring;
    float x = 1051.0;
    for (int i = 0; i < 20; i  )
    {
        mybits = bitset<16>(x*32);
        mystring = mybits.to_string<char, string::traits_type, string::allocator_type>();
        cout << x << "\t" << "mystring: " << mystring << '\n';
        x  = 0.1;
    }
    return 0;
}

However, the result is this:

1051    mystring: 1000001101100000
1051.1  mystring: 1000001101100011
1051.2  mystring: 1000001101100110
1051.3  mystring: 1000001101101001
1051.4  mystring: 1000001101101100
1051.5  mystring: 1000001101101111
1051.6  mystring: 1000001101110011
1051.7  mystring: 1000001101110110
1051.8  mystring: 1000001101111001
1051.9  mystring: 1000001101111100
1052    mystring: 1000001101111111
1052.1  mystring: 1000001110000011
1052.2  mystring: 1000001110000110
1052.3  mystring: 1000001110001001
1052.4  mystring: 1000001110001100
1052.5  mystring: 1000001110001111
1052.6  mystring: 1000001110010011
1052.7  mystring: 1000001110010110
1052.8  mystring: 1000001110011001
1052.9  mystring: 1000001110011100

There are problems at fractional part. For example 1051.5 should be 1000001101110000, not 1000001101101111 (the fractional part is wrong due to the nature of float variable). There are also problems at 1052.0 and 1052.5. How can I fix it?

CodePudding user response：

How to reduce the float rounding error when converting it into fixed-point in C ?

Rearrange the calculation of the fixed-point encoding to round the result to an integer and so that all arithmetic in it is performed exactly until a single division just before the rounding, as with mybits = bitset<16>(std::round((x*10 i)*32/10));. This will produce correct results until something beyond i = 317,169. (Remove x = 0.1; from the loop; x is used as an unchanging value in this new formula.)

The problem stems from the fact that .1 is not representable in a binary-based floating-point format, so the source text 0.1 is converted to 0.1000000000000000055511151231257827021181583404541015625 (when IEEE-754 “double precision” is used for double), and each addition of that to x (in x = 0.1;) performs an operation that rounds the ideal real-arithmetic sum to the nearest value representable in double, and, since x is float, rounds that again to the nearest value representable in float (typically the IEEE-754 “single precision” format).

The desired value for the fixed-point number in iteration i is 1051 i/10, converted to a fixed-point encoding with five fraction bits. The encoding of this is (1051 i/10) • 32 rounded to the nearest integer. So the value we want to compute is round((1051 i/10) • 32), where “round” is the desired round-to-integer function (such as round-to-nearest-ties-to-even, or round-to-nearest-ties-to-away).

We can write this as a fraction as ((1051•10 i)•32) / 10. The advantage of this is that (1051•10 i)•32 is an integer and can be calculated exactly, with either integer or floating-point arithmetic, as long as it stays within the bounds of exact arithmetic. (For the “single precision” format, this means (1051•10 i)•32 ≤ 2²⁴, so i ≤ 2¹⁹−10,510 = 513,778.)

Then the only unwanted rounding is in the division. That division occurs immediately before the desired rounding to an integer, so it is not exacerbated by any other operations. So we can compute the fixed-point encoding as std::round((x*10 i)*32/10) and only be concerned with the rounding error in the division by ten. (To use std::round, include <cmath>. Note that std::round rounds halfway cases away from zero. To use the current floating-point rounding mode, usually round-to-nearest-ties-to-even by default, use std::nearbyint.)

A rounding in the division will cause an error in the final result only if it causes a value of (x•10 i)*32/10 whose fraction portion is not exactly ½ to become a value with a fraction of exactly ½. (The converse, causing a value with a fraction of ½ to become a value with some other fraction does not occur because a value with a fraction of ½ is exactly representable in binary floating-point, so no rounding occurs. An exception would be if the number were so large it would be beyond the point where any fractions are representable. However, this does not occur for the IEEE-754 “single precision” format unless the value is also overflowing the Q10.5 format.)

Assuming round-to-nearest is in use, any computed result is at most ½ ULP from the real-arithmetic result. (“ULP” stands for “Unit of Least Precision,” the effective position value of the lowest bit in the significand given the exponent scaling.) Therefore, (x•10 i)*32/10 can round to a value with fraction ½ only if its fraction portion is at most ½ ULP from that value. The nearest the fraction portion of any such quotient can be to ½ without being ½ is 4/10 or 6/10. The distance of these from ½ is 1/10. So as long as 1/10 exceeds ½ ULP, std::round((x*10 i)*32/10) produces the desired result.

For numbers in [2¹⁹, 2²⁰), the ULP of the “single precision” format is 2⁻⁴ = 1/16, which is less than 1/10. Therefore, considering only non-negative i, as long as (x*10 i)*32/10 < 2²⁰, the result is correct. For x = 1051, this gives us (1051•10 i)•32/10 < 2²⁰ ⇒ i < 317,170.

Thus we can use mybits = bitset<16>(std::round((x*10 i)*32/10)); up until i = 317,169, at least.

CodePudding user response：

One solution is to convert from decimal floating point, or some other number representation (rationals come to mind). Decimal floats are usually built into gcc, but you'll need other solutions, if you use some other compiler. Also, the default gcc decimal floats are quite heavy (about 100k in binary size, if you just use them once). They are also rather inaccurate and slow.