Remove rows with all or some NAs (missing values) in data.frame

Question

I'd like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

b) contain NAs in only some columns, so I can also get this result:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

score 1162 · Accepted Answer · edited Jun 14 '17 at 15:10

Also check complete.cases :

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

What is the significance of the trailing comma in `final[complete.cases(final),]`? — hertzsprung, Oct 01 '12 at 11:39
the `complete.cases(final)` returns a boolean of rows where there are no `NA` like `(TRUE, FALSE, TRUE)`. The trailing comma says all columns. Thus, before the comma you filter on the rows but after the comma you column you do no filtering and ask for everything — Kay, Mar 17 '21 at 17:52

score 287 · Answer 2 · answered Feb 01 '11 at 12:00

287

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

answered Feb 01 '11 at 12:00

Roman Luštrik

64,404
24
143
187

score 157 · Answer 3 · edited Mar 07 '19 at 12:25

157

tidyr has a new function drop_na:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2

edited Mar 07 '19 at 12:25

Arthur Yip

4,223
21
42

answered Aug 16 '16 at 08:49

lukeA

48,497
5
73
84

What are the advantages of drop_na() over na.omit()? Faster? – wordsforthewise Oct 11 '20 at 19:41
When I Am trying this command df %>% drop_na(rnor, cfam) Got an error like this Error: Can't subset columns that don't exist. x Column `rnor` doesn't exist. why ? – user90 Oct 12 '20 at 09:34
`rnor` is supposed to be a column name in your table – Calum You Mar 03 '21 at 01:40

score 96 · Answer 4 · answered Feb 02 '11 at 21:58

96

I prefer following way to check whether rows contain any NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

sum(row.has.na)

and eventually drop them

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

answered Feb 02 '11 at 21:58

donshikin

1,353
7
6

4

This is extremely slow. Much slower than e.g. the aforementioned complete.cases() solution. At least, in my case, on xts data. – Dave Jan 17 '19 at 10:53
3

`rowSum(!is.na(final))` seems better suited than `apply()` – sindri_baldur Feb 08 '19 at 19:24

Pierre L · Answer 5 · 2016-11-18T13:59:11.617

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

This is the most reliable method to remove rows, when you need at least a number of NAs to remove that row. Helped me a lot! — Gabriel G., Mar 28 '21 at 21:12

score 47 · Answer 6 · answered Nov 05 '13 at 06:30

Another option if you want greater control over how rows are deemed to be invalid is

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

but how can you use this if you want to check many columns, without typing each one, can you use a range final[,4:100]? — Herman Toothrot, Oct 20 '16 at 10:56

score 42 · Answer 7 · edited Jun 20 '20 at 09:12

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

If you don't want to use `data.table`, use `complete.cases()`.

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

Your results may vary based on length, width, and sparsity of your particular dataset.

Note log scale on y axis.

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)

score 23 · Answer 8 · answered Apr 12 '17 at 05:44

23

Using dplyr package we can filter NA as follows:

dplyr::filter(df,  !is.na(columnname))

answered Apr 12 '17 at 05:44

Raminsu

299
3
3

3

This performs about 10.000 times slower than `drop_na()` – Zimano Feb 21 '20 at 15:49
3

@Zimano Maybe true but for multiple variables `drop_na` uses "any" logic and `filter` uses "all" logic. So if you need more flexiblity in expression, filter has more possibilities. – jiggunjer Jul 26 '20 at 09:33
1

@jiggunjer That's absolutely true! It really depends on what you're trying to achieve :) – Zimano Jul 30 '20 at 12:10

Leo · Answer 9 · 2014-09-19T14:39:08.327

17

This will return the rows that have at least ONE non-NA value.

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

final[rowSums(is.na(final))<(length(final)-1),]

edited Sep 19 '14 at 14:39

answered Sep 19 '14 at 12:36

Leo

1,621
10
19

score 17 · Answer 10 · edited Mar 07 '18 at 14:57

17

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

edited Mar 07 '18 at 14:57

C8H10N4O2

15,256
6
74
113

answered Feb 09 '16 at 17:52

LegitMe

500
4
8

final[as.logical((rowSums(is.na(final))-ncol(final))),] for a universal answer – Ferroao Feb 22 '17 at 14:02

score 14 · Answer 11 · edited Nov 11 '14 at 22:20

14

We can also use the subset function for this.

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

edited Nov 11 '14 at 22:20

Peter Pei Guo

7,536
17
32
51

answered Nov 11 '14 at 22:15

Ramya Ural

141
1
2

bschneidr · Answer 12 · 2020-07-24T03:54:45.310

One approach that's both general and yields fairly-readable code is to use the filter() function and the across() helper functions from the {dplyr} package.

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries

df %>% 
  filter(across(one_of(vars_to_check),
                ~ !is.na(.x)))

# Filter all the columns to exclude NA
df %>% 
  filter(across(everything(),
                ~ !is.na(.)))

# Filter only numeric columns
df %>%
  filter(across(where(is.numeric),
                ~ !is.na(.)))

Similarly, there are also the variant functions in the dplyr package (filter_all, filter_at, filter_if) which accomplish the same thing:

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries
df %>% 
  filter_at(.vars = vars(one_of(vars_to_check)),
            ~ !is.na(.))

# Filter all the columns to exclude NA
df %>% 
  filter_all(~ !is.na(.))

# Filter only numeric columns
df %>%
  filter_if(is.numeric,
            ~ !is.na(.))

See [here](https://stackoverflow.com/a/62829853/4627134) for another example using `across` — jiggunjer, Jul 26 '20 at 09:40

Jerry T · Answer 13 · 2016-12-10T18:26:40.800

I am a synthesizer:). Here I combined the answers into one function:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: http://stackoverflow.com/a/30461945/2292993
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}

score 8 · Answer 14 · answered Mar 15 '17 at 16:51

Assuming dat as your dataframe, the expected output can be achieved using

1.rowSums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

score 4 · Answer 15 · answered Feb 22 '18 at 22:19

delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

score 3 · Answer 16 · edited Apr 17 '20 at 17:57

3

My guess is that this could be more elegantly solved in this way:

  m <- matrix(1:25, ncol = 5)
  m[c(1, 6, 13, 25)] <- NA
  df <- data.frame(m)
  library(dplyr) 
  df %>%
  filter_all(any_vars(is.na(.)))
  #>   X1 X2 X3 X4 X5
  #> 1 NA NA 11 16 21
  #> 2  3  8 NA 18 23
  #> 3  5 10 15 20 NA

edited Apr 17 '20 at 17:57

Arsen Khachaturyan

6,472
4
32
36

answered May 08 '18 at 20:35

Joni Hoppen

528
2
18

10

this will retain rows with `NA`. I think what the OP wants is: `df %>% filter_all(all_vars(!is.na(.)))` – asifzuba Jun 26 '18 at 07:18

Remove rows with all or some NAs (missing values) in data.frame

16 Answers16

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

If you don't want to use `data.table`, use `complete.cases()`.

Benchmark result

Benchmark script

Linked

Related

Remove rows with all or some NAs (missing values) in data.frame

16 Answers16

If performance is a priority, use data.table and na.omit() with optional param cols=.

If you don't want to use data.table, use complete.cases().

Benchmark result

Benchmark script

Linked

Related

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

If you don't want to use `data.table`, use `complete.cases()`.