How to delete columns that contain ONLY NAs?

Question

I have a data.frame containing some columns with all NA values, how can I delete them from the data.frame.

Can I use the function

na.omit(...)

specifying some additional arguments?

Hi there! Please make your post reproducible. Read the post [**how to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to do this. Thank you. — Arun, Apr 12 '13 at 10:05
Does this post help? http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame — Arun, Apr 12 '13 at 10:06
can you post `head(data)`? Do you want to remove corresponding columns or rows? — Nishanth, Apr 12 '13 at 10:10
@e4e5f4 I want to remove corresponding columns (all the values of the columns I want to remove are NA) — Lorenzo Rigamonti, Apr 12 '13 at 10:12
Possible duplicate of [Remove columns from dataframe where ALL values are NA](http://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na) — Sam Firke, Apr 12 '16 at 15:02

score 129 · Accepted Answer · edited Aug 06 '16 at 09:59

129

One way of doing it:

df[, colSums(is.na(df)) != nrow(df)]

If the count of NAs in a column is equal to the number of rows, it must be entirely NA.

Or similarly

df[colSums(!is.na(df)) > 0]

edited Aug 06 '16 at 09:59

Axeman

27,115
6
69
82

answered Apr 12 '13 at 10:25

Ciarán Tobin

6,235
1
20
40

1

How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)? – discipulus Mar 09 '15 at 07:01
2

@lovedynasty Probably best to submit a separate question, assuming you haven't already since posting your comment. But anyway, you can always do something like `df[, colSums(is.na(df)) < nrow(df) * 0.5]` i.e. only keep columns with at least 50% non-blanks. – Ciarán Tobin Apr 13 '15 at 11:31
2

People working with a correlation matrix must use `df[, colSums(is.na(df)) != nrow(df) - 1]` since the diagonal is always `1` – Boern Oct 01 '15 at 12:54
11

Can use this with the dplyr (version 0.5.0) select_if function as well. `df %>% select_if(colSums(!is.na(.)) > 0)` – Stefan Avey Nov 29 '16 at 21:58
@MadScone it is giving me syntax error at "," for df[, colSums(is.na(df)) != nrow(df)] and syntax error at "!" in df[colSums(!is.na(df)) > 0]. Am i missing something – Aravind S Jan 29 '20 at 12:58

Brad Cannell · Answer 2 · 2021-01-09T00:29:38.620

60

Here is a dplyr solution:

df %>% select_if(~sum(!is.na(.)) > 0)

Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function:

df %>% 
  select(
    where(
      ~sum(!is.na(.x)) > 0
    )
  )

df %>% 
  select(
    where(
      ~!all(is.na(.x))
    )
  )

edited Jan 09 '21 at 00:29

answered Jul 28 '17 at 22:06

Brad Cannell

2,076
1
19
32

4

At ~15k rows and ~5k columns, this is truly taking forever. – EngrStudent Dec 12 '17 at 14:58
@EngrStudent Was it faster with the accepted answer's solution? – johnny May 04 '20 at 16:59
It's been a number of years. I don't remember. DJV has a nice timing post below. – EngrStudent May 05 '20 at 11:20

score 25 · Answer 3 · answered Jul 11 '17 at 19:59

25

Another option is the janitor package:

df <- remove_empty_cols(df)

https://github.com/sfirke/janitor

answered Jul 11 '17 at 19:59

jsta

2,487
24
29

11

`janitor::remove_empty_cols()` is deprecated - use `df – André.B May 14 '19 at 21:45

Simon O'Hanlon · Answer 4 · 2013-04-12T10:19:53.603

24

It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:

#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
#      id nas vals
#   1   1  NA   NA
#   2   2  NA    2
#   3   3  NA    1
#   4   4  NA    2
#   5   5  NA    2
#   6   6  NA    3
#   7   7  NA    2
#   8   8  NA    3
#   9   9  NA    3
#   10 10  NA    2

#Use this command to remove columns that are entirely NA values, it will elave columns where only some vlaues are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
#      id vals
#   1   1   NA
#   2   2    2
#   3   3    1
#   4   4    2
#   5   5    2
#   6   6    3
#   7   7    2
#   8   8    3
#   9   9    3
#   10 10    2

If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.

edited Apr 12 '13 at 10:19

answered Apr 12 '13 at 10:14

Simon O'Hanlon

54,383
9
127
173

The data.frame has two type of columns: one in whohc all values are numbers and the other in which all values are NA – Lorenzo Rigamonti Apr 12 '13 at 10:16
So this will work then. It only removes columns were ALL values are `NA`. – Simon O'Hanlon Apr 12 '13 at 10:16
1

Good solution. I would do `apply(is.na(df), 1, all)` though just because it's slightly neater and `is.na()` is used on all of `df` rather than one row at a time (show be a bit faster). – Ciarán Tobin Apr 12 '13 at 10:29
@MadScone good tip - does look neater. You should apply across columns not rows though. – Simon O'Hanlon Apr 12 '13 at 10:40
@MadScone Edits are locked after 5 minutes on comments. I shouldn't worry, it's no biggie!! :-) – Simon O'Hanlon Apr 12 '13 at 10:55
How can I delete columns having more than a threshold of NA? or in Percentage (lets say above 50%)? – discipulus Mar 09 '15 at 06:53
@SimonO'Hanlon three years later.. are you still setting seeds like this? :} – rawr May 10 '16 at 19:39
I did this on a data table and it became a vector. Nearly gave me a heart attack. Had to convert to a frame. It ran a lot faster. – EngrStudent Dec 12 '17 at 15:04

SJ9 · Answer 5 · 2018-02-21T22:09:34.270

An intuitive script: dplyr::select_if(~!all(is.na(.))). It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).

> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )

> df %>% glimpse()
Observations: 10
Variables: 3
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA

> df %>% select_if(~!all(is.na(.))) 
   id vals
1   1   NA
2   2    1
3   3    1
4   4   NA
5   5    1
6   6    1
7   7    1
8   8    2
9   9    3
10 10   NA

score 18 · Answer 6 · answered Mar 17 '16 at 11:59

18

Another option with Filter

Filter(function(x) !all(is.na(x)), df)

NOTE: Data from @Simon O'Hanlon's post.

answered Mar 17 '16 at 11:59

akrun

674,427
24
381
486

DJV · Answer 7 · 2020-05-05T16:27:34.463

Because performance was really important for me, I benchmarked all the functions above.

NOTE: Data from @Simon O'Hanlon's post. Only with size 15000 instead of 10.

library(tidyverse)
library(microbenchmark)

set.seed(123)
df <- data.frame(id = 1:15000,
                 nas = rep(NA, 15000), 
                 vals = sample(c(1:3, NA), 15000,
                               repl = TRUE))
df

MadSconeF1 <- function(x) x[, colSums(is.na(x)) != nrow(x)]

MadSconeF2 <- function(x) x[colSums(!is.na(x)) > 0]

BradCannell <- function(x) x %>% select_if(~sum(!is.na(.)) > 0)

SimonOHanlon <- function(x) x[ , !apply(x, 2 ,function(y) all(is.na(y)))]

jsta <- function(x) janitor::remove_empty(x)

SiboJiang <- function(x) x %>% dplyr::select_if(~!all(is.na(.)))

akrun <- function(x) Filter(function(y) !all(is.na(y)), x)

mbm <- microbenchmark(
  "MadSconeF1" = {MadSconeF1(df)},
  "MadSconeF2" = {MadSconeF2(df)},
  "BradCannell" = {BradCannell(df)},
  "SimonOHanlon" = {SimonOHanlon(df)},
  "SiboJiang" = {SiboJiang(df)},
  "jsta" = {jsta(df)}, 
  "akrun" = {akrun(df)},
  times = 1000)

mbm

Results:

Unit: microseconds
         expr    min      lq      mean  median      uq      max neval  cld
   MadSconeF1  154.5  178.35  257.9396  196.05  219.25   5001.0  1000 a   
   MadSconeF2  180.4  209.75  281.2541  226.40  251.05   6322.1  1000 a   
  BradCannell 2579.4 2884.90 3330.3700 3059.45 3379.30  33667.3  1000    d
 SimonOHanlon  511.0  565.00  943.3089  586.45  623.65 210338.4  1000  b  
    SiboJiang 2558.1 2853.05 3377.6702 3010.30 3310.00  89718.0  1000    d
         jsta 1544.8 1652.45 2031.5065 1706.05 1872.65  11594.9  1000   c 
        akrun   93.8  111.60  139.9482  121.90  135.45   3851.2  1000 a


autoplot(mbm)

mbm %>% 
  tbl_df() %>%
  ggplot(aes(sample = time)) + 
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~expr, scales = "free")

Sometimes the first iteration is a JIT compiled, so it has very poor, and not very characteristic, times. I think it’s interesting what the larger sample size does to the right tails of the distribution. This is good work. — EngrStudent, May 05 '20 at 11:58
I run it once again, wasn't sure I changed the plot. Regarding the distribution, indeed. I should probably compare different sample sizes when I'll have the time. — DJV, May 05 '20 at 12:27
if you qqplot (https://ggplot2.tidyverse.org/reference/geom_qq.html) one of the trends, such as "akrun" then I bet there is one point that is very different from the distribution of the rest. The rest represent how long it takes if you run it repeatedly, but that represents what happens if you run it once. There is an old saying: you can have 20 years of experience or you can have only one years worth of experience 20 times. — EngrStudent, May 05 '20 at 13:00
very nice! I’m surprised by several samples being in the extreme tail. I wonder why it is that those are so much more costly. JIT might be 1 or 2 but not 20. Condition? Interrupts? Other? Thanks again for the update. — EngrStudent, May 05 '20 at 16:56
You're welcome, thank you for the thoughts. Don't know, I actually allowed it to run "freely". — DJV, May 05 '20 at 19:59

How to delete columns that contain ONLY NAs?

7 Answers7

Linked

Related