How to convert float to fixed point (higher precision) in C-CodePudding

I'm trying to implement my own fixed point arithmatic in C to (later) do higher precision calculations. I was thinking something like

class FixedPoint
{
int intPart;
unsigned long long fracPart[some number];
}

I think it should work if I - for example for addition - first add two fracPart[some number]'s and if they overflow add 1 to fracPart[some number - 1] and so on.

But I'm stuck at converting a double "d" to a class like this. intPart = d of course works. Then doing

double Temp = d - intPart;

gives me the fractional part. But how do I correctly assign this to fracPart[0]? In decimal, if long long's had exactly 20 digits, I could just do Temp * 100000000000000000000, so that 0.14 becomes 14000000000000000000. But if in binary I take the mantissa-bits of d (53/54 bits), assign them to fracPart[0] (64 bits), add the hidden bit and shift this left by 13 (or 12 because of the hidden bit), the value is wrong. Nothing I found online so far is helpfull...

CodePudding user response：

Forget decimal. Use powers of 2. Your first fractional part should contain bits with the value 2^-1, 2^-2, ... 2^-64. The nice thing about floating point is that you can easily scale your values by powers of two. In other words, subtract the integer part, then multiply with 2^64, then take the next integer part, and so on. Something like this should work for you:

#include <cmath>
// using std::floor, std::ldexp
#include <cstdint>
// using std::int64_t, std::uint64_t
#include <cstdio>
// using std::printf


class FixedPoint
{
  std::int64_t ipart;
  std::uint64_t fpart[2];

public:
  explicit FixedPoint(double f) noexcept
  {
    // rounded down so that the fractional part is always positive
    ipart = std::floor(f);
    f -= ipart;
    for(std::uint64_t& fractional: fpart) {
      f = std::ldexp(f, 64);
      fractional = f;
      f -= fractional;
    }
  }
  operator double() const noexcept
  {
    double f = 0.;
    for(int i = 1; i >= 0; --i) {
      f  = fpart[i];
      f = std::ldexp(f, -64);
    }
    f  = ipart;
    return f;
  }
};



int main()
{
  double f1 = 123.4567;
  FixedPoint p(f1);
  double f2 = p;
  std::printf("%g = %g\n", f1, f2);
}

Some final thoughts:

I hope you know that there are actual libraries to do this kind of stuff for you? I assume this is just an exercise to get your feet wet with floating point and fixed point. Otherwise cease and desist. ;-)
I switched to std::uint64_t because it is way more convenient to have a standard precision in your data type.
A downside in using uint64_t is that there is no fast double <-> uint64_t machine instruction in x86_64. Using uint32_t might actually be faster.
Using int for the integer part but a larger type for the fractional part is pointless. Due to alignment you just waste 32 bit of space in your struct that you could use for a larger range. Either switch both to 32 bit or use 64 bit for the integer part and keep the size of the fractional part (size of the whole array) a multiple of 64 bit, e.g. 2 x 32 bit
Note that std::frexp gives you the exponent of a floating point number and a normalized mantissa in the range [0.5, 1) (or zero). That would allow you to replace the integer part with the exponent for an arbitrarily large range without loss of precision. Of course at this point you are just reimplementing extended precision floating point in software.

CodePudding user response：

You should

#include <boost/multiprecision.hpp>

Yes, really. Implementing your own bignum can be fun and instructive, but you will mess up. On something you never thought of. There are a lot of little edge cases and things you need to get right — especially when doing non-integer bignums.

The Boost Libraries gets it right, and makes it oh so easy to use in your own code.

(You can actually include a more specific file for just the MP type you wish to use, assuming you do not want to include the entire MP library.)

BTW, fixed point is an integer that is scaled by some fixed factor. For example, say you are writing software for the banking industry. You would store values as hundredths of a cent, a (dollar * 10000) factor. What you are trying to do is not fixed point.

CodePudding user response：

Would using a string be a valid solution in your case? You could overload all the mathematical operation and implement them as you would with paper and pen. It is a lot of work and you may encounter non-trivial efficiency problems, but it may be a viable solution. It is not clear to me if you are having trouble to convert the double to your struct, but this is a minimal working example that uses the string:

#include <cmath>
#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>

int main()
{
    int prec=20;
    double a=1.3235335151561;
    int intpart=std::floor(a);
    double tmp=a-intpart;
    std::ostringstream os;
    os<<std::setprecision(prec)<<tmp;
    std::string str = os.str();
    std::cout<<str<<std::endl;
    int frac_part[prec];
    for(int i=0; i<prec;   i){
        frac_part[i]=(int)str[i 2]-48;
        //jumps 0., and the ascii number for '0' is 48
        std::cout<<frac_part[i]<<" ";
    }
    std::cout<<"\n";

    }

Of course once you have the string using the array for the fractional part is superfluous in my opinion.