Subtract extremely small number from one in C-CodePudding

I need to subtract extremely small double number x from 1 i.e. to calculate 1-x in C for 0<x<1e-16. Because of machine precision restrictions for small enoughf x I will always get 1-x=1. Simple solution is to switch from double to some more precise format like long. But because of some restrictions I can't switch to more precise number formats.

What is the most efficient way to get accurate value of 1-x where x is an extremely small double if I can't use more precise formats and I need to store the result of the subtraction as a double? In practice I would like to avoid percentage errors greater then 1% (between double representation of 1-x and its actual value).

P.S. I am using Rcpp to calculate the quantiles of standard normal distribution via qnorm function. This function is symmetric around 0.5 being much more accurate for values close to 0. Therefore instead of qnorm(1-(1e-30)) I would like to calculate -qnorm(1e-30) but to derive 1e-30 from 1-(1e-30) I need to deal with a precision problem. The restriction on double is due to the fact that as I know it is not safe to use more precise numeric formats in Rcpp. Note that my inputs to qnorm could be sought of exogeneous so I can't to derive 1-x from x durning some preliminary calculations.

CodePudding user response：

Simple solution is to switch from double to some more precise format like long [presumably, double]

~~In that case you have no solution. long double is an alias for double on all modern machines.~~ I stand corrected, gcc and icc still support it, only cl has dropped support for it for a long time.

So you have two solutions, and they're not mutually exclusive:

Use an arbitrary precision library instead of the built-in types. They're orders of magnitude slower, but if that's the best your algorithm can work with then that's that.
Use a better algorithm, or at least rearrange your equation variables, to not have this need in the first place. Use distribution and cancellation rules to avoid the problem entirely. Without a more in depth description of your problem we can't help you, but I can tell you with certainty that double is more than enough to allow us to model airplane AI and flight parameters anywhere in the world.

CodePudding user response：

Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.

Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).

The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-() and a test case.

#include <iostream>

class ddreal {
private:
    static inline double Plus2(double a, double b, double& err) {
        double s = a   b;
        double bb = s - a;
        err = (a - (s - bb))   (b - bb);
        return s;
    }
    static inline void Plus3(double& a, double& b, double& c) {
        double t3, t2, t1 = Plus2(a, b, t2);
        a = Plus2(c, t1, t3);
        b = Plus2(t2, t3, c);
    }
public:
    double x[2];
    ddreal() { x[0] = x[1] = 0.0; }
    ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
    ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
    ddreal& operator -= (ddreal const& b) {
        double t1, t2, s2;
        x[0] = Plus2(x[0], -b.x[0], s2);
        t1 = Plus2(x[1], -b.x[1], t2);
        x[1] = Plus2(s2, t1, t1);
        t1  = t2;
        Plus3(x[0], x[1], t1);
        return *this;
    }
    inline double toDouble() const { return x[0]   x[1]; }
};

inline ddreal operator-(ddreal const& a, ddreal const& b)
{
    ddreal retval = a;
    return retval -= b;
}

int main()
{
    double sdone{ 1.0 };
    double sdwee{ 1.0e-42 };
    double sdval = sdone - sdwee;
    double sdans = sdone - sdval;
    std::cout << sdans << "\n"; // Gives zero, as expected

    ddreal ddone{ 1.0 };
    ddreal ddwee{ 1.0e-42 };
    ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
    ddreal ddans = ddone - ddval;
    std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42

    ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
    ddreal ddsec = ddone - ddalt;
    std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42

    return 0;
}

Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)

I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double type fails completely.

Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise (or /fp:strict) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast will break the code, completely.