204

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.

How can I remove the NA values so that I can compute the max?

Mus
  • 5,994
  • 19
  • 75
  • 112
CodeGuy
  • 26,751
  • 71
  • 191
  • 310

7 Answers7

282

Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)

Setting na.rm=TRUE does just what you're asking for:

d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)

If you do want to remove all of the NAs, use this idiom instead:

d <- d[!is.na(d)]

A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.

Josh O'Brien
  • 148,908
  • 25
  • 332
  • 435
  • This is a very bad idea. It fails and gives `-Inf` for a `d` of all NAs. – user3932000 Aug 01 '19 at 23:27
  • @user3932000 Just to be clear for others, your complaint is really about how the base R function `max()` behaves (as, for instance, when doing `max(c(NA, NA)`). Personally, I think its behavior is reasonable; I expect it was constructed that way so that you get the expected result when doing things like `a – Josh O'Brien Aug 02 '19 at 20:23
  • @user3932000 Somewhat tangentially, one of R's many strengths as a data analysis platform is its sophisticated handling of missing data, the result of **much** careful thought on the part of its authors. (If you're interested in the subject, [see here](https://docs.scipy.org/doc/numpy-1.10.0/neps/missing-data.html) for a good discussion of some of the issues involved, from the point of view of programmers who were engaged in incorporating R-like `NA`-handling facilities in Python's excellent **NumPy** package.) – Josh O'Brien Aug 02 '19 at 20:24
  • @user3932000: is that answer really bad? What would you consider the maximum of the null set? – Cliff AB Jan 29 '20 at 21:20
  • @CliffAB It doesn't have a maximum. You can assign the max to be -∞ (and min to be +∞), but that's not always desired or intuitive. Also, when you remove all `NA`s from a vector of `NA`s, you would expect an empty vector, not -∞. – user3932000 Jan 29 '20 at 22:57
  • @user3932000: I suppose one might think they want whatever from `max(NULL)`, but `-Inf` is a very mathematically consistent answer to me – Cliff AB Jan 30 '20 at 00:19
98

The na.omit function is what a lot of the regression routines use internally:

vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
IRTFM
  • 240,863
  • 19
  • 328
  • 451
22

Use discard from purrr (works with lists and vectors).

discard(v, is.na) 

The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:

v %>% discard(is.na)
v %>% `[`(!is.na(.))

Note that na.omit does not work on lists:

> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1

$b
[1] 2

$c
[1] NA
qwr
  • 6,786
  • 3
  • 42
  • 72
22

?max shows you that there is an extra parameter na.rm that you can set to TRUE.

Apart from that, if you really want to remove the NAs, just use something like:

myvec[!is.na(myvec)]
Nick Sabbe
  • 11,354
  • 1
  • 40
  • 56
16

Just in case someone new to R wants a simplified answer to the original question

How can I remove NA values from a vector?

Here it is:

Assume you have a vector foo as follows:

foo = c(1:10, NA, 20:30)

running length(foo) gives 22.

nona_foo = foo[!is.na(foo)]

length(nona_foo) is 21, because the NA values have been removed.

Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.

Scott C Wilson
  • 16,711
  • 9
  • 54
  • 76
16

You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.

Michael Hoffman
  • 27,420
  • 6
  • 55
  • 80
3

I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)

microbenchmark::microbenchmark(
  purrr::map(airquality,function(x) {x[!is.na(x)]}), 
  purrr::map(airquality,na.omit),
  purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
  times = 1e6)

Unit: microseconds
                                                     expr    min     lq      mean median      uq       max neval cld
 purrr::map(airquality, function(x) {     x[!is.na(x)] })   66.8   75.9  130.5643   86.2  131.80  541125.5 1e+06 a  
                          purrr::map(airquality, na.omit)   95.7  107.4  185.5108  129.3  190.50  534795.5 1e+06  b 
  purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06   c

For reference, here's the original test of x[!is.na(x)] vs na.omit:

microbenchmark::microbenchmark(
    purrr::map(airquality,function(x) {x[!is.na(x)]}), 
    purrr::map(airquality,na.omit), 
    times = 1000000)


Unit: microseconds
                                              expr  min   lq      mean median    uq      max neval cld
 map(airquality, function(x) {     x[!is.na(x)] }) 53.0 56.6  86.48231   58.1  64.8 414195.2 1e+06  a 
                          map(airquality, na.omit) 85.3 90.4 134.49964   92.5 104.9 348352.8 1e+06   b
jsavn
  • 326
  • 3
  • 5