Difference between NA_real_ and NaN-CodePudding

When I use .Internal(inspect()) to NA_real_ and NaN, it returns,

> .Internal(inspect(NA_real_))
@0x000001e79724d0e0 14 REALSXP g0c1 [REF(2)] (len=1, tl=0) nan
> .Internal(inspect(NaN))
@0x000001e797264a88 14 REALSXP g0c1 [REF(2)] (len=1, tl=0) nan

It seems like their only difference is the memory address.

However, when I coerce the NA_real_ and NaN into character, it returns,

> as.character(c(NaN, NA_real_))
[1] "NaN" NA

I understand that it should return the above result as NaN can't be character and it will be coerced into "NaN" but NA_real will be coerced into NA_character_. But considering their gut is same, how can R returns different results for them?

Thank you in advance for any suggestions!

CodePudding user response：

Well. First off, remember that NA is an R concept that has no equivalent in C. So, by necessity, NA needs to be represented differently in C. The fact that .Internal(inspect()) does not make this distinction doesn’t mean it isn’t made elsewhere. In fact, it so happens that .Internal(inspect()) uses Rprintf to print the value’s internal double floating point representation. And, indeed, R NAs are encoded as an NaN value in a C floating point type.

Secondly, you observe that “their only difference is the memory address.” — So what? At least conceptually, distinct memory addresses are fully sufficient to distinguish NA and NaN, nothing more is required.

But as a matter of fact R distinguishes these values by a different route. This is possible because the IEEE 754 double precision floating point format has multiple different representations of NaN, and R reserves a specific one for NAs:

static double R_ValueOfNA(void)
{
    /* The gcc shipping with Fedora 9 gets this wrong without
     * the volatile declaration. Thanks to Marc Schwartz. */
    volatile ieee_double x;
    x.word[hw] = 0x7ff00000;
    x.word[lw] = 1954;
    return x.value;
}

and:

/* is a value known to be a NaN also an R NA? */
int attribute_hidden R_NaN_is_R_NA(double x)
{
    ieee_double y;
    y.value = x;
    return (y.word[lw] == 1954);
}

int R_IsNA(double x)
{
    return isnan(x) && R_NaN_is_R_NA(x);
}

int R_IsNaN(double x)
{
    return isnan(x) && ! R_NaN_is_R_NA(x);
}

(src/main/arithmetic.c)

CodePudding user response：

NA is a statistical or data integrity concept: the idea of a "missing value". Eg if your data comes from people filling in forms, a bad entry or missing entry would be treated as NA.

NaN is a numerical or computational concept: something that is "not a number". Eg 0/0 is NAN, because the result of this computation is undefined (but note that 1/0 is Inf, or infinity, and similarly -1/0 is -Inf).

The way that R handles these concepts internally isn't something that you should ever be concerned about.