72

What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:

sum( is.na( data ) ) > 0

But that requires examining each element, coercion, and the sum function.

A5C1D2H2I1M1N2O1R2T1
  • 177,446
  • 27
  • 370
  • 450
SFun28
  • 32,209
  • 43
  • 123
  • 233

6 Answers6

71

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

BrodieG
  • 48,306
  • 7
  • 80
  • 131
70

I'm thinking:

any(is.na(data))

should be slightly faster.

Sacha Epskamp
  • 42,423
  • 17
  • 105
  • 128
  • 1
    although it still requires iterating through each element. wondering if there's a first() function or something like that that stops once a condition is met – SFun28 Jul 01 '11 at 18:46
  • Not sure, I wouldn't be surprised if `any()` stops after it finds a FALSE. Any way, the moment where `any(...)` becomes too slow to handle is probably past the moment your RAM runs out. – Sacha Epskamp Jul 01 '11 at 18:56
  • 1
    There is also the `all()` function that works as expected btw. Might be useful (not for this problem but in general). – Sacha Epskamp Jul 01 '11 at 19:00
  • 7
    `any` and `all` do stop iterating when they reach a `TRUE` or a `FALSE` respectively; see `checkValues` in http://svn.r-project.org/R/trunk/src/main/logic.c ; the `is.na` still coerces everything though. – Aaron left Stack Overflow Jul 01 '11 at 19:35
  • 2
    Aaron, the remaining cost is the `is.na(data)` which gets computed first, and for *all* elements of data irrespective of whether an early one is in fact NA. We do improve on that with the Rcpp sugar version of `is.na()` (which is implemented in C++ for use via Rcpp). See my answer for more. – Dirk Eddelbuettel Jul 01 '11 at 20:51
17

We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because

  • a vectorised R solution still computes every single element of the vector expression

  • if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.

So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.

Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.

Dirk Eddelbuettel
  • 331,520
  • 51
  • 596
  • 675
  • Is there a reason why R's `any` computes every single element, rather than stopping at the first instance? – joran Jul 01 '11 at 21:30
  • 3
    R's `any` doesn't know what's inside it; it just evaluates whatever its argument is (all of it) and then applies `any` to it, which does stop at the first `FALSE`, but again, only after evaluating all of its argument. Dirk's Rcpp sugar version of `any` (as I understand it) does know how to evaluate what's inside of it term by term (at least for some expressions, anyway) so it can check each term for TRUE/FALSE as it's evaluated in turn. – Aaron left Stack Overflow Jul 02 '11 at 02:06
  • @Dirk - very cool. It seems that the most efficient way to do this is with embedded c++...or is that cheating since its not a pure R answer =) thanks for the link to Rcpp! – SFun28 Jul 02 '11 at 04:01
8

One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53 
EDi
  • 12,470
  • 2
  • 42
  • 54
  • 1
    great idea to benchmark against the nacount function! you are right that timing depends on where the first NA is (if any). I repeated your experiment except I placed a single NA at the end of the long vector. here are the results: nacount(x) = 86.14 , any(is.na(x)) = .4, sum(is.na(x)) > 0 = 1.64. nacount (as expected) is horrible in this case. what's more interesting is how much better any(...) is than sum(...)>0 – SFun28 Jul 02 '11 at 03:55
6

Here are some actual times from my (slow) machine for some of the various methods discussed so far:

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

It's not surprising that match and %in% are similar, since %in% is implemented using match.

joran
  • 157,274
  • 30
  • 404
  • 439
  • thanks for pulling this together. I think it shows that any(...) is a fantastic, pure R solution. – SFun28 Jul 02 '11 at 04:02
3

You can try:

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)
Manuel Ramón
  • 2,470
  • 2
  • 16
  • 22