What is the fastest way to detect whether a vector has at least one non-NA element? (i.e., opposite-CodePudding

As we learn from this answer, there's a substantial performance increase when using anyNA() over any(is.na()) to detect whether a vector has at least one NA element. This makes sense, as the algorithm of anyNA() stops after the first NA value it finds, whereas any(is.na()) has to first run over the entire vector with is.na().

By contrast, I want to know whether a vector has at least 1 non-NA value. This means that I'm looking for an implementation that would stop after the first encounter with a non-NA value. Yes, I can use any(!is.na()), but then I face the issue with having is.na() run over the entire vector first.

Is there a performant opposite equivalent to anyNA(), i.e., "anyNonNA()"?

CodePudding user response：

I'm not aware of a native function that stops if it comes across a non-NA value, but we can write a simple one using Rcpp:

Rcpp::cppFunction("bool any_NonNA(NumericVector v) {
  for(size_t i = 0; i < v.length(); i  ) {
   if(!(Rcpp::traits::is_na<REALSXP>(v[i]))) return true;
  }
  return false;
}")

This creates an R function called any_NonNA which does what we need. Let's test it on a large vector of 100,000 NA values:

test <- rep(NA, 1e5)

any_NonNA(test)
#> [1] FALSE

any(!is.na(test))
#> [1] FALSE

Now let's make the first element a non-NA:

test[1] <- 1

any_NonNA(test)
#> [1] TRUE

any(!is.na(test))
#> [1] TRUE

So it gives the correct result, but is it faster?

Certainly, in this example, since it should stop after the first element, it should be much quicker. This is indeed the case if we do a head-to-head comparison:

microbenchmark::microbenchmark(
  baseR = any(!is.na(test)),
  Rcpp  = any_NonNA(test)
)
#> Unit: microseconds
#> expr   min    lq    mean median    uq     max neval cld
#> baseR 275.1 525.0 670.948 533.05 568.7 13029.9   100   b
#> Rcpp   1.6   2.1   4.319   3.30   5.1    33.7   100  a

As expected, this is a couple of orders of magnitude faster. What about if our first non-NA value is mid-way through the vector?

test[1] <- NA
test[50000] <- 1

microbenchmark::microbenchmark(
  baseR = any(!is.na(test)),
  Rcpp  = any_NonNA(test)
)
#> Unit: microseconds
#> expr   min     lq    mean median     uq     max neval cld
#> baseR 332.1 579.35 810.948 597.95 624.40 12010.4   100   b
#> Rcpp 299.4 300.70 311.516 305.10 309.25   370.1   100  a

Still faster, but not by much.

If we put our non-NA value at the end we shouldn't see much difference:

test[50000] <- NA
test[100000] <- 1

microbenchmark::microbenchmark(
  baseR = any(!is.na(test)),
  Rcpp  = any_NonNA(test)
)
#> Unit: microseconds
#> expr   min     lq    mean median    uq     max neval cld
#> baseR 395.6 631.65 827.173  642.6 663.8 11357.0   100   a
#> Rcpp 596.3 602.25 608.011  605.8 612.6   632.6   100   a

So this does indeed look to be faster than the base R solution (at least for large vectors).

CodePudding user response：

anyNA() seems to be a collaboration by google. I think to check wether there are any NA is far common than the opposite, thus justifying the existene of that "special" function.

Here my attemp for numeric only:

anyNonNA <- Rcpp::cppFunction(
'bool anyNonNA(NumericVector x){
        for (double i:x) if (!Rcpp::NumericVector::is_na(i)) return TRUE;
    return FALSE;}
')

var <- rep(NA_real_, 1e7)

any(!is.na(var)) #FALSE
anyNonNA(var)    #FALSE

var[5e6] <- 0

any(!is.na(var)) #TRUE
anyNonNA(var)    #TRUE

microbenchmark::microbenchmark(any(!is.na(var)))
#Unit: milliseconds
#             expr     min      lq     mean  median       uq     max neval
# any(!is.na(var)) 41.1922 46.6087 55.57655 59.1408 61.87265 74.4424   100

microbenchmark::microbenchmark(anyNonNA(var))
#Unit: milliseconds
#          expr     min       lq     mean  median      uq    max neval
# anyNonNA(var) 10.6333 10.71325 11.05704 10.8553 11.2082 14.871   100