Elegant way to report missing values in a data.frame

Question

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:

for (Var in names(airquality)) {
    missing <- sum(is.na(airquality[,Var]))
    if (missing > 0) {
        print(c(Var,missing))
    }
}

Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.

@kohske: that was my first thought, but the results are a `table` of characters and you would have to parse out the number of NAs. — Joshua Ulrich, Nov 29 '11 at 20:33
I'm reverting your question, since you posted an answer. If you want to comment on an answer, please do so as a comment to that answer. If questions also contains answers, this becomes very confusing. — Andrie, Nov 29 '11 at 20:48
@Andrie: I disagree with your edit, as a key issue I am facing is reporting only variables with missing values. Furthermore, your rollback removed a change I made to the code. I've edited my question to include this information, and added my modified version of Josh's code to a comment. — Zach, Nov 29 '11 at 20:57
@Zach Your new edit looks fine to me. I'm not averse to adding additional data / requests in a question once it's live, by the way, if this clarifies the question. — Andrie, Nov 29 '11 at 21:01
There are half a million ways to do this, see [CRAN Task View - MissingData](https://cran.r-project.org/web/views/MissingData.html) — zx8754, May 20 '19 at 16:40

Joshua Ulrich · Accepted Answer · 2011-11-29T20:46:57.477

160

Just use sapply

> sapply(airquality, function(x) sum(is.na(x)))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

You could also use apply or colSums on the matrix created by is.na()

> apply(is.na(airquality),2,sum)
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0
> colSums(is.na(airquality))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0

edited Nov 29 '11 at 20:46

answered Nov 29 '11 at 20:29

Joshua Ulrich

163,034
29
321
400

13

I modified your code slightly, to only report missing value: `M 0]` – Zach Nov 29 '11 at 20:54
Thanks! Learned a lot. – Bombyx mori Nov 15 '15 at 04:43
Hi @Joshua Ulrich , thanks a lot for your succinct code. I would like to add a column in the data frame which would show the percentage of na values . Can you please provide some help regarding this ? – DukeLover Jun 17 '17 at 06:25
2

@Zach I use a version of your suggestion for checking that required fields have a value: `M – Anthony Simon Mielniczuk Oct 03 '17 at 20:12
@Joshua adding an option for %s would be ace too! – radek May 30 '19 at 01:03

score 8 · Answer 2 · answered Jan 27 '17 at 02:58

8

We can use map_df with purrr.

library(mice)
library(purrr)

# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R  Wind  Temp Month   Day
# <int>   <int> <int> <int> <int> <int>
# 1    37       7     0     0     0     0

answered Jan 27 '17 at 02:58

Keiku

5,545
2
28
34

1

What's the advantage of `map_df` over `sapply`? – Zach Jan 27 '17 at 15:08
1

@Zach I think that there is no big difference, but Hadley said not to use sapply() inside a function. See Exceptions and debugging · Advanced R. http://adv-r.had.co.nz/Exceptions-Debugging.html. – Keiku Jan 27 '17 at 17:59
for lazy people like me, you can write the above code in shorter purrr syntax for functions (~) so it looks like this: `map_df( air quality, ~sum(is.na(.) )` – Agile Bean Jan 29 '19 at 09:37
1

@Zach the advantage of `map_df` over `sapply` is only when the result has many rows as the output format of map_df is always a tibble. – Agile Bean Jan 29 '19 at 09:40
1

@Zach: it's better to use `vapply` vs `sapply` in functions because `vapply` gives you a known result structure (that you specify). `sapply` may return an array or a list, depending on the function output. A disadvantage of `map_df` is that you give it a data.frame as input, and it returns a data.frame subclass, not a data.frame. There's no guarantee that tibbles will behave as data.frames do in all necessary cases in the future. – Joshua Ulrich May 30 '19 at 10:09

score 8 · Answer 3 · answered Jul 02 '18 at 07:09

My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:

library(naniar)
library(UpSetR)

riskfactors %>%
  as_shadow_upset() %>%
  upset()

It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

Or for categorical variables:

gg_miss_fct(x = riskfactors, fct = marital)

These examples are from package vignette that lists other interesting visualizations.

Thanks for posting this! There is now a dedicated `gg_miss_upset()` function in the latest release, which will be submitted to CRAN once they return from holiday.http://naniar.njtierney.com/reference/gg_miss_upset.html — Nick Tierney, Sep 10 '18 at 06:41

score 6 · Answer 4 · edited Feb 13 '18 at 02:16

6

summary(airquality)

already gives you this information

The VIM packages also offers some nice missing data plot for data.frame

library("VIM")
aggr(airquality)

edited Feb 13 '18 at 02:16

radek

6,256
7
48
74

answered Nov 10 '16 at 20:08

Steffen Moritz

4,758
11
30
46

Can the VIM package report which specific observations have missing data? – Anthony Simon Mielniczuk Nov 17 '16 at 18:50
don't think so.. but you can get this quite easy (you would have to replace airquality with your own dataframe): res 0,] – Steffen Moritz Nov 17 '16 at 21:17

score 4 · Answer 5 · edited Aug 12 '15 at 15:34

4

More succinct-: sum(is.na(x[1]))

That is

x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0

edited Aug 12 '15 at 15:34

MichaelChirico

31,197
13
98
169

answered Apr 11 '15 at 13:34

Keith Whittingham

306
2
10

this doesn't answer the original question, which is to find the number of `NA`s for *all* columns in the data – Ben Bolker Apr 11 '15 at 13:37

score 4 · Answer 6 · answered Feb 13 '18 at 02:12

4

Another graphical alternative - plot_missing function from excellent DataExplorer package:

Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).

answered Feb 13 '18 at 02:12

radek

6,256
7
48
74

The `plot_missing()` function in the `DataExplorer` package is now `PlotMissing()`. – coip Mar 06 '18 at 22:59
1

@coip `PlotMissing()` is deprecated. Please use `plot_missing()` instead. See [#49](https://github.com/boxuancui/DataExplorer/issues/49) for more details. – Boxuan Aug 02 '18 at 13:17

score 2 · Answer 7 · answered Feb 06 '17 at 06:28

Another function that would help you look at missing data would be df_status from funModeling library

library(funModeling)

iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.

df_status(iris.2)

This will give you the number and percentage of NAs in each column.

score 1 · Answer 8 · edited Feb 13 '18 at 02:15

1

I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.

install.packages("Amelia")
library(Amelia)
missmap(airquality)

You can also run the following code will return the logic values of na

row.has.na <- apply(training, 1, function(x){any(is.na(x))})

edited Feb 13 '18 at 02:15

radek

6,256
7
48
74

answered Apr 19 '17 at 09:44

drexxx

87
2
11

radek · Answer 9 · 2018-02-13T02:15:41.937

1

For one more graphical solution, visdat package offers vis_miss.

library(visdat)
vis_miss(airquality)

Very similar to Amelia output with a small difference of giving %s on missings out of the box.

edited Feb 13 '18 at 02:15

answered Dec 03 '17 at 23:37

radek

6,256
7
48
74

score 1 · Answer 10 · answered Aug 27 '18 at 04:05

Another graphical and interactive way is to use is.na10 function from heatmaply library:

library(heatmaply)

heatmaply(is.na10(airquality), grid_gap = 1, 
          showticklabels = c(T,F),
            k_col =3, k_row = 3,
            margins = c(55, 30), 
            colors = c("grey80", "grey20"))

Probably won't work well with large datasets..

score 0 · Answer 11 · edited Aug 25 '14 at 06:52

0

If you want to do it for particular column, then you can also use this

length(which(is.na(airquality[1])==T))

edited Aug 25 '14 at 06:52

Jeromy Anglim

30,761
24
109
167

answered Jun 09 '14 at 06:24

Chintak Chhapia

726
7
10

4

You don't need to compare a logical vector to T. You can also count the number of TRUE elements in a logical vector by summing it. – Houshalter Dec 01 '14 at 18:03

score 0 · Answer 12 · answered Apr 05 '19 at 04:11

0

ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:

answered Apr 05 '19 at 04:11

radek

6,256
7
48
74

sbha · Answer 13 · 2020-03-16T00:42:39.853

A dplyr solution to get the count could be:

summarise_all(df, ~sum(is.na(.)))

Or to get a percentage:

summarise_all(df, ~(sum(is_missing(.) / nrow(df))))

Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:

is_missing <- function(x){
  missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
  ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
         ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}

# sample ugly data
df <- data.frame(a = c(NA, '1', '  ', 'missing'),
                 b = c(0, 2, NaN, 4),
                 c = c('NA', 'b', '-9', 'null'),
                 d = 1:4,
                 e = c(1, Inf, -Inf, 0))

# counts:
> summarise_all(df, ~sum(is_missing(.)))
  a b c d e
1 3 1 3 0 2

# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
     a    b    c d   e
1 0.75 0.25 0.75 0 0.5

Elegant way to report missing values in a data.frame

13 Answers13

Linked