As we learn from this answer, there's a substantial performance increase when using anyNA()
over any(is.na())
to detect whether a vector has at least one NA
element. This makes sense, as the algorithm of anyNA()
stops after the first NA
value it finds, whereas any(is.na())
has to first run over the entire vector with is.na()
.
By contrast, I want to know whether a vector has at least 1 non-NA
value. This means that I'm looking for an implementation that would stop after the first encounter with a non-NA
value. Yes, I can use any(!is.na())
, but then I face the issue with having is.na()
run over the entire vector first.
Is there a performant opposite equivalent to anyNA()
, i.e., "anyNonNA()"?
CodePudding user response:
I'm not aware of a native function that stops if it comes across a non-NA value, but we can write a simple one using Rcpp:
Rcpp::cppFunction("bool any_NonNA(NumericVector v) {
for(size_t i = 0; i < v.length(); i ) {
if(!(Rcpp::traits::is_na<REALSXP>(v[i]))) return true;
}
return false;
}")
This creates an R function called any_NonNA
which does what we need. Let's test it on a large vector of 100,000 NA values:
test <- rep(NA, 1e5)
any_NonNA(test)
#> [1] FALSE
any(!is.na(test))
#> [1] FALSE
Now let's make the first element a non-NA:
test[1] <- 1
any_NonNA(test)
#> [1] TRUE
any(!is.na(test))
#> [1] TRUE
So it gives the correct result, but is it faster?
Certainly, in this example, since it should stop after the first element, it should be much quicker. This is indeed the case if we do a head-to-head comparison:
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 275.1 525.0 670.948 533.05 568.7 13029.9 100 b
#> Rcpp 1.6 2.1 4.319 3.30 5.1 33.7 100 a
As expected, this is a couple of orders of magnitude faster. What about if our first non-NA value is mid-way through the vector?
test[1] <- NA
test[50000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 332.1 579.35 810.948 597.95 624.40 12010.4 100 b
#> Rcpp 299.4 300.70 311.516 305.10 309.25 370.1 100 a
Still faster, but not by much.
If we put our non-NA value at the end we shouldn't see much difference:
test[50000] <- NA
test[100000] <- 1
microbenchmark::microbenchmark(
baseR = any(!is.na(test)),
Rcpp = any_NonNA(test)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> baseR 395.6 631.65 827.173 642.6 663.8 11357.0 100 a
#> Rcpp 596.3 602.25 608.011 605.8 612.6 632.6 100 a
So this does indeed look to be faster than the base R solution (at least for large vectors).
CodePudding user response:
anyNA() seems to be a collaboration by google. I think to check wether there are any NA is far common than the opposite, thus justifying the existene of that "special" function.
Here my attemp for numeric only:
anyNonNA <- Rcpp::cppFunction(
'bool anyNonNA(NumericVector x){
for (double i:x) if (!Rcpp::NumericVector::is_na(i)) return TRUE;
return FALSE;}
')
var <- rep(NA_real_, 1e7)
any(!is.na(var)) #FALSE
anyNonNA(var) #FALSE
var[5e6] <- 0
any(!is.na(var)) #TRUE
anyNonNA(var) #TRUE
microbenchmark::microbenchmark(any(!is.na(var)))
#Unit: milliseconds
# expr min lq mean median uq max neval
# any(!is.na(var)) 41.1922 46.6087 55.57655 59.1408 61.87265 74.4424 100
microbenchmark::microbenchmark(anyNonNA(var))
#Unit: milliseconds
# expr min lq mean median uq max neval
# anyNonNA(var) 10.6333 10.71325 11.05704 10.8553 11.2082 14.871 100