4

I'm having trouble using rowwise() to count the number of NAs in each row. My minimal example:

df <- data.frame(Q1 = c(rep(1, 1), rep(NA, 9)),
                 Q2 = c(rep(2, 2), rep(NA, 8)),
                 Q3 = c(rep(3, 3), rep(NA, 7))
)
df
   Q1 Q2 Q3
1   1  2  3
2  NA  2  3
3  NA NA  3
4  NA NA NA
5  NA NA NA
6  NA NA NA
7  NA NA NA
8  NA NA NA
9  NA NA NA
10 NA NA NA

I would like to create a new column that counts the number of NAs in each row. I can do this very simply by writing

 df$Count_NA <- rowSums(is.na(df))
 df
   Q1 Q2 Q3 Count_NA
1   1  2  3        0
2  NA  2  3        1
3  NA NA  3        2
4  NA NA NA        3
5  NA NA NA        3
6  NA NA NA        3
7  NA NA NA        3
8  NA NA NA        3
9  NA NA NA        3
10 NA NA NA        3

But if I try and do this via dplyr using rowwise(), I get the wrong answer - the column Count_NA has the same number in each row:

df %>%
   rowwise() %>%
   mutate(Count_NA = sum(is.na(.)))
# A tibble: 10 x 4
# Rowwise: 
      Q1    Q2    Q3 Count_NA
   <dbl> <dbl> <dbl>    <int>
 1     1     2     3       24
 2    NA     2     3       24
 3    NA    NA     3       24
 4    NA    NA    NA       24
 5    NA    NA    NA       24
 6    NA    NA    NA       24
 7    NA    NA    NA       24
 8    NA    NA    NA       24
 9    NA    NA    NA       24
10    NA    NA    NA       24

what am I doing wrong, and how do i fix this?

Many thanks in advance

Thomas Philips

Waldi
  • 21,789
  • 5
  • 13
  • 47
Thomas Philips
  • 489
  • 7
  • 15
  • Does this answer your question? [Count NAs per row in dataframe](https://stackoverflow.com/questions/37801338/count-nas-per-row-in-dataframe) – TarJae Apr 18 '21 at 16:58
  • Probable duplicate: [Understanding rowwise and c_across](https://stackoverflow.com/questions/64282839/understanding-rowwise-and-c-across) – Ian Campbell Apr 18 '21 at 19:47

6 Answers6

5

Use cur_data() rather than dot. .[cur_group_id(), ], c(Q1, Q2, Q3), across() or c_across() (or c_across with the argument as per other answer) would also work.

Note that it is best to use ungroup afterwards or else it will retain the memory of the rowwise and you might get unexpected results later on.

df %>%
   rowwise() %>%
   mutate(Count_NA = sum(is.na(cur_data()))) %>%
   ungroup

giving:

# A tibble: 10 x 4
      Q1    Q2    Q3 Count_NA
   <dbl> <dbl> <dbl>    <int>
 1     1     2     3        0
 2    NA     2     3        1
 3    NA    NA     3        2
 4    NA    NA    NA        3
 5    NA    NA    NA        3
 6    NA    NA    NA        3
 7    NA    NA    NA        3
 8    NA    NA    NA        3
 9    NA    NA    NA        3
10    NA    NA    NA        3
G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
5

One issue is that . here resolves to the whole frame, not just the whole row. Another dplyr method, using c_across:

df %>%
    rowwise() %>%
    mutate(a=sum(is.na(c_across(everything()))))
# # A tibble: 10 x 4
# # Rowwise: 
#       Q1    Q2    Q3     a
#    <dbl> <dbl> <dbl> <int>
#  1     1     2     3     0
#  2    NA     2     3     1
#  3    NA    NA     3     2
#  4    NA    NA    NA     3
#  5    NA    NA    NA     3
#  6    NA    NA    NA     3
#  7    NA    NA    NA     3
#  8    NA    NA    NA     3
#  9    NA    NA    NA     3
# 10    NA    NA    NA     3

The biggest difference I can see between using this and cur_data() is that c_across allows for variable-selection a little more directly, as in c_across(starts_with("Q")). Granted, one could always select(cur_data(),...), so this is a weak argument.

r2evans
  • 77,184
  • 4
  • 55
  • 96
  • Thanks - how does one determine what the . resolves to? – Thomas Philips Apr 18 '21 at 16:40
  • Frankly, I did `func % rowwise() %>% mutate(Count_NA = func(.))`, then looked at `z` to see. It doesn't help resolve the issue, just to identify it. – r2evans Apr 18 '21 at 16:41
  • @G.Grothendieck's answer is canonical in the sense that that answer uses `cur_data()`. This answer is canonical in that it uses `c_across`. Pick and choose. I tend to believe `c_across` was meant for this type of thing, but I'm sure there are counter-examples as well. – r2evans Apr 18 '21 at 16:42
  • @r2evans, FYI, [speed comparison](https://stackoverflow.com/a/67151190/13513328) – Waldi Apr 18 '21 at 17:39
  • @Waldi, I didn't say I *like* `c_across`, just that it's canonically what (I think) is the recommendation for tasks like this. I really tend to avoid row-wise operations in `dplyr` verbs: most of my data tends to have 100K or more rows, so I have been conditioned to avoid it :-) – r2evans Apr 18 '21 at 17:44
  • 1
    @r2evans, I actually like your answer, but I was surprised by the speed comparison. I also work with >100k tables, and left `dplyr` for `data.table`. – Waldi Apr 18 '21 at 17:49
5

baseR answer

df$Count_NA <- apply(df, 1, function(x) sum(is.na(x)))                 

df
   Q1 Q2 Q3 Count_NA
1   1  2  3        0
2  NA  2  3        1
3  NA NA  3        2
4  NA NA NA        3
5  NA NA NA        3
6  NA NA NA        3
7  NA NA NA        3
8  NA NA NA        3
9  NA NA NA        3
10 NA NA NA        3

So can be integrated into dplyr pipe

df %>% mutate(count_NA = apply(., 1, function(x) sum(is.na(x))))

   Q1 Q2 Q3 count_NA
1   1  2  3        0
2  NA  2  3        1
3  NA NA  3        2
4  NA NA NA        3
5  NA NA NA        3
6  NA NA NA        3
7  NA NA NA        3
8  NA NA NA        3
9  NA NA NA        3
10 NA NA NA        3
AnilGoyal
  • 14,820
  • 3
  • 16
  • 30
4

In case future you were interested in a row-wise solution with purrr package functions:

library(purrr)

df %>%
  mutate(Count_NA = pmap(., ~ sum(is.na(c(...)))))


   Q1 Q2 Q3 Count_NA
1   1  2  3        0
2  NA  2  3        1
3  NA NA  3        2
4  NA NA NA        3
5  NA NA NA        3
6  NA NA NA        3
7  NA NA NA        3
8  NA NA NA        3
9  NA NA NA        3
10 NA NA NA        3

Anoushiravan R
  • 5,528
  • 7
  • 24
  • You can use `pmap` for much more complex row-wise operations with multiple variables: in case you were interested you can check this: https://stackoverflow.com/questions/66935005/apply-function-to-a-row-in-a-data-frame-using-dplyr – Anoushiravan R Apr 18 '21 at 16:54
  • 1
    FYI, [speed comparison](https://stackoverflow.com/a/67151190/13513328) – Waldi Apr 18 '21 at 17:39
  • @Waldi oh it's an interesting comparison. But I normally prefer to use `pmap` for more complex row-wise operations. It's no denying that `rowSums` is more simple and by far the fastest. – Anoushiravan R Apr 18 '21 at 18:01
  • 1
    I like your solution and I'm also a fond user of `purrr`, but `microbenchmark` is always useful to find the most efficient way. – Waldi Apr 18 '21 at 18:03
  • @Waldi Thank you very much. You are right. Since it's barely a year I started learning R, I put more emphasis on first getting the desired output & then trying alternative solutions. However in this case I was trying to propose a solution different than those already posted. It's also interesting there is a package called `bench` that does the same as `microbenchmark` & I don't yet know how it differs from the latter one. – Anoushiravan R Apr 18 '21 at 18:11
4

rowSums works directly with mutate without rowwise:

df %>% mutate(count_NA = rowSums(is.na(.)))

   Q1 Q2 Q3 count_NA
1   1  2  3        0
2  NA  2  3        1
3  NA NA  3        2
4  NA NA NA        3
5  NA NA NA        3
6  NA NA NA        3
7  NA NA NA        3
8  NA NA NA        3
9  NA NA NA        3
10 NA NA NA        3

Note that your initial solution is by far the fastest one:

microbenchmark::microbenchmark(
  df$Count_NA <- rowSums(is.na(df)),
  df$Count_NA <- apply(df, 1, function(x) sum(is.na(x))),
  df %>% mutate(count_NA = rowSums(is.na(.))),
  df %>%
    mutate(Count_NA = purrr::pmap(., ~ sum(is.na(c(...))))),
  df %>%
    rowwise() %>%
    mutate(a=sum(is.na(c_across(everything())))),
  df %>%
  rowwise() %>%
  mutate(Count_NA = sum(is.na(cur_data()))) %>%
  ungroup
)

Unit: microseconds
                                                                            expr     min       lq
                                               df$Count_NA <- rowSums(is.na(df))    39.8    64.30
                          df$Count_NA <- apply(df, 1, function(x) sum(is.na(x)))  1661.6  1868.40
                                     df %>% mutate(count_NA = rowSums(is.na(.)))  1181.7  1572.80
                   df %>% mutate(Count_NA = purrr::pmap(., ~sum(is.na(c(...)))))  4749.9  5190.35
             df %>% rowwise() %>% mutate(a = sum(is.na(c_across(everything())))) 29124.1 31148.50
 df %>% rowwise() %>% mutate(Count_NA = sum(is.na(cur_data()))) %>%      ungroup 70473.0 73659.70
      mean   median       uq     max neval   cld
    79.033    76.25    88.75   174.0   100 a    
  2082.960  1966.50  2075.75  8777.3   100  b   
  1722.178  1676.20  1791.60  3112.9   100  b   
  5726.549  5396.40  5745.25 28592.1   100   c  
 33567.825 31983.05 33637.00 54676.9   100    d 
 77902.342 76492.85 81199.15 98942.1   100     e
Unit: microseconds
                                                                            expr     min       lq
                                               df$Count_NA <- rowSums(is.na(df))    38.2    44.95
                          df$Count_NA <- apply(df, 1, function(x) sum(is.na(x)))  1584.8  1765.30
                                     df %>% mutate(count_NA = rowSums(is.na(.)))  1247.9  1496.95
                   df %>% mutate(Count_NA = purrr::pmap(., ~sum(is.na(c(...)))))  4614.0  5110.50
 df %>% rowwise() %>% mutate(Count_NA = sum(is.na(cur_data()))) %>%      ungroup 67413.5 70865.45
      mean   median       uq      max neval cld
    71.159    65.85    84.40    162.2   100 a  
  1967.629  1894.45  2093.30   3436.6   100 ab 
  1814.193  1666.25  1895.35   9031.0   100 a  
  5796.483  5380.70  5665.10  15309.7   100  b 
 78309.807 75275.30 79776.40 286964.3   100   c
Waldi
  • 21,789
  • 5
  • 13
  • 47
  • I think I'm most surprised by how inefficient `cur_data()` is. – Ian Campbell Apr 18 '21 at 19:49
  • I started out asking this question because I had the rowSums and the apply() solutions working well, and was trying to move my solution from base R to dplyr. Now that I see your speed comparisons, i'm sticking with base R! That said, I have learned a lot from this discussion. Thank you everyone. – Thomas Philips Apr 19 '21 at 01:36
2

Using dapply

library(collapse)
dapply(df, function(x) sum(is.na(x)), MARGIN = 1)
#[1] 0 1 2 3 3 3 3 3 3 3
akrun
  • 674,427
  • 24
  • 381
  • 486