87

I have a data.frame containing some columns with all NA values, how can I delete them from the data.frame.

Can I use the function

na.omit(...) 

specifying some additional arguments?

David Arenburg
  • 87,271
  • 15
  • 123
  • 181
Lorenzo Rigamonti
  • 1,435
  • 6
  • 24
  • 34
  • 1
    Hi there! Please make your post reproducible. Read the post [**how to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to do this. Thank you. – Arun Apr 12 '13 at 10:05
  • Does this post help? http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame – Arun Apr 12 '13 at 10:06
  • can you post `head(data)`? Do you want to remove corresponding columns or rows? – Nishanth Apr 12 '13 at 10:10
  • @e4e5f4 I want to remove corresponding columns (all the values of the columns I want to remove are NA) – Lorenzo Rigamonti Apr 12 '13 at 10:12
  • 2
    Possible duplicate of [Remove columns from dataframe where ALL values are NA](http://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na) – Sam Firke Apr 12 '16 at 15:02

7 Answers7

129

One way of doing it:

df[, colSums(is.na(df)) != nrow(df)]

If the count of NAs in a column is equal to the number of rows, it must be entirely NA.

Or similarly

df[colSums(!is.na(df)) > 0]
Axeman
  • 27,115
  • 6
  • 69
  • 82
Ciarán Tobin
  • 6,235
  • 1
  • 20
  • 40
  • 1
    How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)? – discipulus Mar 09 '15 at 07:01
  • 2
    @lovedynasty Probably best to submit a separate question, assuming you haven't already since posting your comment. But anyway, you can always do something like `df[, colSums(is.na(df)) < nrow(df) * 0.5]` i.e. only keep columns with at least 50% non-blanks. – Ciarán Tobin Apr 13 '15 at 11:31
  • 2
    People working with a correlation matrix must use `df[, colSums(is.na(df)) != nrow(df) - 1]` since the diagonal is always `1` – Boern Oct 01 '15 at 12:54
  • 11
    Can use this with the dplyr (version 0.5.0) select_if function as well. `df %>% select_if(colSums(!is.na(.)) > 0)` – Stefan Avey Nov 29 '16 at 21:58
  • @MadScone it is giving me syntax error at "," for df[, colSums(is.na(df)) != nrow(df)] and syntax error at "!" in df[colSums(!is.na(df)) > 0]. Am i missing something – Aravind S Jan 29 '20 at 12:58
60

Here is a dplyr solution:

df %>% select_if(~sum(!is.na(.)) > 0)

Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function:

df %>% 
  select(
    where(
      ~sum(!is.na(.x)) > 0
    )
  )
df %>% 
  select(
    where(
      ~!all(is.na(.x))
    )
  )
Brad Cannell
  • 2,076
  • 1
  • 19
  • 32
25

Another option is the janitor package:

df <- remove_empty_cols(df)

https://github.com/sfirke/janitor

jsta
  • 2,487
  • 24
  • 29
24

It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:

#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
#      id nas vals
#   1   1  NA   NA
#   2   2  NA    2
#   3   3  NA    1
#   4   4  NA    2
#   5   5  NA    2
#   6   6  NA    3
#   7   7  NA    2
#   8   8  NA    3
#   9   9  NA    3
#   10 10  NA    2

#Use this command to remove columns that are entirely NA values, it will elave columns where only some vlaues are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
#      id vals
#   1   1   NA
#   2   2    2
#   3   3    1
#   4   4    2
#   5   5    2
#   6   6    3
#   7   7    2
#   8   8    3
#   9   9    3
#   10 10    2

If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.

Simon O'Hanlon
  • 54,383
  • 9
  • 127
  • 173
20

An intuitive script: dplyr::select_if(~!all(is.na(.))). It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).

> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )

> df %>% glimpse()
Observations: 10
Variables: 3
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA

> df %>% select_if(~!all(is.na(.))) 
   id vals
1   1   NA
2   2    1
3   3    1
4   4   NA
5   5    1
6   6    1
7   7    1
8   8    2
9   9    3
10 10   NA
SJ9
  • 369
  • 2
  • 11
18

Another option with Filter

Filter(function(x) !all(is.na(x)), df)

NOTE: Data from @Simon O'Hanlon's post.

akrun
  • 674,427
  • 24
  • 381
  • 486
7

Because performance was really important for me, I benchmarked all the functions above.

NOTE: Data from @Simon O'Hanlon's post. Only with size 15000 instead of 10.

library(tidyverse)
library(microbenchmark)

set.seed(123)
df <- data.frame(id = 1:15000,
                 nas = rep(NA, 15000), 
                 vals = sample(c(1:3, NA), 15000,
                               repl = TRUE))
df

MadSconeF1 <- function(x) x[, colSums(is.na(x)) != nrow(x)]

MadSconeF2 <- function(x) x[colSums(!is.na(x)) > 0]

BradCannell <- function(x) x %>% select_if(~sum(!is.na(.)) > 0)

SimonOHanlon <- function(x) x[ , !apply(x, 2 ,function(y) all(is.na(y)))]

jsta <- function(x) janitor::remove_empty(x)

SiboJiang <- function(x) x %>% dplyr::select_if(~!all(is.na(.)))

akrun <- function(x) Filter(function(y) !all(is.na(y)), x)

mbm <- microbenchmark(
  "MadSconeF1" = {MadSconeF1(df)},
  "MadSconeF2" = {MadSconeF2(df)},
  "BradCannell" = {BradCannell(df)},
  "SimonOHanlon" = {SimonOHanlon(df)},
  "SiboJiang" = {SiboJiang(df)},
  "jsta" = {jsta(df)}, 
  "akrun" = {akrun(df)},
  times = 1000)

mbm

Results:

Unit: microseconds
         expr    min      lq      mean  median      uq      max neval  cld
   MadSconeF1  154.5  178.35  257.9396  196.05  219.25   5001.0  1000 a   
   MadSconeF2  180.4  209.75  281.2541  226.40  251.05   6322.1  1000 a   
  BradCannell 2579.4 2884.90 3330.3700 3059.45 3379.30  33667.3  1000    d
 SimonOHanlon  511.0  565.00  943.3089  586.45  623.65 210338.4  1000  b  
    SiboJiang 2558.1 2853.05 3377.6702 3010.30 3310.00  89718.0  1000    d
         jsta 1544.8 1652.45 2031.5065 1706.05 1872.65  11594.9  1000   c 
        akrun   93.8  111.60  139.9482  121.90  135.45   3851.2  1000 a


autoplot(mbm)

enter image description here

mbm %>% 
  tbl_df() %>%
  ggplot(aes(sample = time)) + 
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~expr, scales = "free")

enter image description here

DJV
  • 3,724
  • 2
  • 12
  • 27
  • Sometimes the first iteration is a JIT compiled, so it has very poor, and not very characteristic, times. I think it’s interesting what the larger sample size does to the right tails of the distribution. This is good work. – EngrStudent May 05 '20 at 11:58
  • I run it once again, wasn't sure I changed the plot. Regarding the distribution, indeed. I should probably compare different sample sizes when I'll have the time. – DJV May 05 '20 at 12:27
  • 1
    if you qqplot (https://ggplot2.tidyverse.org/reference/geom_qq.html) one of the trends, such as "akrun" then I bet there is one point that is very different from the distribution of the rest. The rest represent how long it takes if you run it repeatedly, but that represents what happens if you run it once. There is an old saying: you can have 20 years of experience or you can have only one years worth of experience 20 times. – EngrStudent May 05 '20 at 13:00
  • very nice! I’m surprised by several samples being in the extreme tail. I wonder why it is that those are so much more costly. JIT might be 1 or 2 but not 20. Condition? Interrupts? Other? Thanks again for the update. – EngrStudent May 05 '20 at 16:56
  • You're welcome, thank you for the thoughts. Don't know, I actually allowed it to run "freely". – DJV May 05 '20 at 19:59