115

My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr. The data entries in the columns are binary(0,1). I am thinking of a row-wise analog of the summarise_each or mutate_each function of dplyr. Below is a minimal example of the data frame:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

> df
   x1 x2 x3 x4 x5
1   1  1  0  1  1
2   0  1  1  0  1
3   0 NA  0 NA NA
4  NA  1  1  1  1
5   0  1  1  0  1
6   1  0  0  0  1
7   1 NA NA NA NA
8  NA NA NA  0  1
9   0  0  0  0  0
10  1  1  1  1  1

I could use something like:

df <- df %>% mutate(sumrow= x1 + x2 + x3 + x4 + x5)

but this would involve writing out the names of each of the columns. I have like 50 columns. In addition, the column names change at different iterations of the loop in which I want to implement this operation so I would like to try avoid having to give any column names.

How can I do that most efficiently? Any assistance would be greatly appreciated.

amo
  • 2,412
  • 3
  • 20
  • 33
  • 12
    Why `dplyr`? Why not just a simple `df$sumrow – David Arenburg Mar 05 '15 at 08:22
  • 7
    You can do both with `dplyr` too as in ```df %>% mutate(sumrow = Reduce(`+`, .))``` or `df %>% mutate(sumrow = rowSums(.))` – David Arenburg Mar 05 '15 at 08:38
  • Yes, the base R `rowSums` function works well. However, trying the suggestions in `dplyr`; `df %>% mutate(sumrow = rowSums(.))` and `df %>% mutate(sumrow = Reduce(`+`, .))` generate errors. I had just thought of fitting the operation within a dplyr chain alongside other preceding and subsequent operations in the chain.@DavidArenburg – amo Mar 05 '15 at 09:10
  • 2
    Update to the latest `dplyr` version and it will work. – David Arenburg Mar 05 '15 at 09:11
  • Related: http://stackoverflow.com/questions/28095526/summarise-over-all-columns – talat Mar 05 '15 at 09:56
  • 1
    Suggestions by David Arenburg worked after updating package dplyr @DavidArenburg – amo Mar 05 '15 at 16:01
  • 1
    @boern David Arenburgs comment was the best answer and most direct solution. Your answer would work but it involves an extra step of replacing NA values with zero which might not be suitable in some cases. – amo Sep 29 '16 at 11:29
  • @amo I disagree: Davids first comment didn't utilize `dplyr` like requested plus requires the exact same step of removing `NA`s by `na.rm = TRUE`. His second comment is `dplyr` but (since the `NA` treatment is missing) it leads to `NA` sums. – Boern Sep 29 '16 at 11:42
  • 1
    @Boern So your are basically saying my comment is invalid because I didn't add `na.rm = TRUE` while your answer with an extra `replace` step and then using my exact comment is somewhat much better? Pretty sneaky (at best) statement and smells like rep wh*ring to me. – David Arenburg Jun 14 '18 at 13:00
  • If you're summing across many rows in `dplyr`, your data is probably not "tidy" and it might be better to reshape or to just use base. – Axeman Jun 15 '18 at 11:16
  • @DavidArenburg Honestly I'm having a hard time to recap, but if you look at the edit history my first post was just `summarise_each(funs(sum))`, a comment to my answer pointed out that this was the right answer, so I improved it. I don't remeber if I even read your comment. Either way, I don't think this discussion is worth our time :) – Boern Sep 14 '18 at 11:47
  • @DavidArenburg I want to select multiple columns (some numeric, some factors) all binary (0,1) to summarize in rows as count and proportion. Is that possible using dplyr? – sar Mar 28 '20 at 23:30

6 Answers6

136

dplyr >= 1.0.0 using across

sum up each row using rowSums (rowwise works for any aggreation, but is slower)

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(across(where(is.numeric))))

sum down each column

df %>%
   summarise(across(everything(), ~ sum(., is.na(.), 0)))

dplyr < 1.0.0

sum up each row

df %>%
   replace(is.na(.), 0) %>%
   mutate(sum = rowSums(.[1:5]))

sum down each column using superseeded summarise_all:

df %>%
   replace(is.na(.), 0) %>%
   summarise_all(funs(sum))
mpalanco
  • 10,839
  • 1
  • 53
  • 63
Boern
  • 5,482
  • 4
  • 44
  • 68
  • 8
    `summarise_each` sums down along each column while what is required is sum along each row – amo Sep 23 '15 at 08:55
  • 1
    I'm trying to achieve the same, but my DF has a column which is a character, hence I cannot sum all the columns. I guess I should modify the `(.[1:5])` part, but unfortunately I am not familiar with the syntax nor I don't know how to look for help on it. Tried with `mutate(sum = rowSums(is.numeric(.)))` but didn't work. – ccamara Jan 11 '17 at 12:07
  • 1
    You could try to also replace all non numeric values with 0 using `replace(!is.numeric(.), 0) %>%` ? – Boern Jan 11 '17 at 12:20
  • Thanks! The problem is that I want to keep those non numeric values, so I'm afraid that if I did understand your suggestion properly it wouldn't be of much use. – ccamara Jan 11 '17 at 18:00
  • 5
    I see. You might want to give `df %>% replace(is.na(.), 0) %>% select_if(is.numeric) %>% summarise_each(funs(sum))` a shot? – Boern Jan 12 '17 at 08:19
  • It also works with a character vector with the names of the columns you want to use, that can be a shortcut or not, depending the situation – Matias Andina Jul 21 '17 at 18:30
  • 2
    Use `summarise_all` instead of `summarise_each` as it has been deprecated. – hmhensen Jun 17 '18 at 23:27
  • 2
    Syntax `mutate(sum = rowSums(.[,-1]))` may come in handy if you don't know how many columns you need to deal with. – Paulo S. Abreu Apr 17 '19 at 17:10
  • Since this question and the answer really got popular - I think it worth update the answer using updated version of `dplyr` @Boern – Sinh Nguyen Jan 30 '21 at 01:59
  • Note that if you use rowwise/ rowSums on a grouped data frame with many many groups, it's worth to `ungroup` first - makes a huge difference in performance https://github.com/tidyverse/dplyr/issues/5819 – tjebo Mar 21 '21 at 13:30
36

If you want to sum certain columns only, I'd use something like this:

library(dplyr)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))
df %>% select(x3:x5) %>% rowSums(na.rm=TRUE) -> df$x3x5.total
head(df)

This way you can use dplyr::select's syntax.

Richard DiSalvo
  • 690
  • 9
  • 15
32

I would use regular expression matching to sum over variables with certain pattern names. For example:

df <- df %>% mutate(sum1 = rowSums(.[grep("x[3-5]", names(.))], na.rm = TRUE),
                    sum_all = rowSums(.[grep("x", names(.))], na.rm = TRUE))

This way you can create more than one variable as a sum of certain group of variables of your data frame.

Erick Chacon
  • 843
  • 8
  • 15
  • great solution! I was looking for a specific dplyr function doing this in recent releases, but couln't find – agenis Sep 20 '17 at 14:55
  • This solution is great. If there are columns you do not want to include you simply need to design the grep() statement to select columns matching a specific pattern. – Trenton Hoffman Jun 27 '18 at 20:25
  • 1
    @TrentonHoffman here is the bit deselect columns a specific pattern. just need the `-` sign: `rowSums(.[-grep("x[3-5]", names(.))], na.rm = TRUE)` – alexb523 Apr 11 '19 at 16:27
28

Using reduce() from purrr is slightly faster than rowSums and definately faster than apply, since you avoid iterating over all the rows and just take advantage of the vectorized operations:

library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))

See this for timings

skd
  • 1,613
  • 1
  • 15
  • 22
  • I like this but how would you do it when you need `na.rm = TRUE` – see24 Mar 18 '20 at 14:08
  • @see24 I'm not sure I know what you mean. This sums vectors a + b + c, all of the same length. Since each vector may or may not have NA in different locations, you cannot ignore them. This would make the vectors unaligned. If you want to remove NA values you have to do it **afterwards** with, for instance, drop_na – skd Mar 19 '20 at 11:31
  • I ended up doing `rowSums(select(., matches("myregex")) , na.rm = TRUE))` because that is what I needed in terms of ignoring NAs. So if the numbers are `sum(NA, 5)` the results is 5. But you said reduce is better than `rowSums` so I was wondering if there is a way to use it in this situation? – see24 Mar 19 '20 at 12:28
  • I see. If you want the sum and to ignore NA values definately the `rowSums` version is probably the best. The main disadvantage is that only `rowSums` and `rowMeans` are available (it is slighly slower than reduce, but not by much). If you need to perform another operation (not the sum) then the `reduce` version is probably the only option. Just avoid using `apply` in this case. – skd Mar 20 '20 at 14:24
24

I encounter this problem often, and the easiest way to do this is to use the apply() function within a mutate command.

library(tidyverse)
df=data.frame(
  x1=c(1,0,0,NA,0,1,1,NA,0,1),
  x2=c(1,1,NA,1,1,0,NA,NA,0,1),
  x3=c(0,1,0,1,1,0,NA,NA,0,1),
  x4=c(1,0,NA,1,0,0,NA,0,0,1),
  x5=c(1,1,NA,1,1,1,NA,1,0,1))

df %>%
  mutate(sum = select(., x1:x5) %>% apply(1, sum, na.rm=TRUE))

Here you could use whatever you want to select the columns using the standard dplyr tricks (e.g. starts_with() or contains()). By doing all the work within a single mutate command, this action can occur anywhere within a dplyr stream of processing steps. Finally, by using the apply() function, you have the flexibility to use whatever summary you need, including your own purpose built summarization function.

Alternatively, if the idea of using a non-tidyverse function is unappealing, then you could gather up the columns, summarize them and finally join the result back to the original data frame.

df <- df %>% mutate( id = 1:n() )   # Need some ID column for this to work

df <- df %>%
  group_by(id) %>%
  gather('Key', 'value', starts_with('x')) %>%
  summarise( Key.Sum = sum(value) ) %>%
  left_join( df, . )

Here I used the starts_with() function to select the columns and calculated the sum and you can do whatever you want with NA values. The downside to this approach is that while it is pretty flexible, it doesn't really fit into a dplyr stream of data cleaning steps.

  • 3
    Seems silly to use `apply` when this is what `rowSums` was designed for. – zacdav Feb 26 '18 at 23:10
  • 7
    In this case `rowSums` works really well as does `rowMeans`, but I always felt a little weird wondering about "What if the thing I need to calculate isn't a sum or a mean?" However, 99% of the time I have to do something like this, it is either a sum or a mean, so maybe the extra bit of flexibility in using the general `apply` function isn't warrented. – Derek Sonderegger Feb 26 '18 at 23:17
20

dplyr >= 1.0.0

In newer versions of dplyr you can use rowwise() along with c_across to perform row-wise aggregation for functions that do not have specific row-wise variants, but if the row-wise variant exists it should be faster.

Since rowwise() is just a special form of grouping and changes the way verbs work you'll likely want to pipe it to ungroup() after doing your row-wise operation.

To select a range of rows:

df %>%
  dplyr::rowwise() %>% 
  dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T))
# %>% dplyr::ungroup() # you'll likely want to ungroup after using rowwise()

To select rows by type:

df %>%
  dplyr::rowwise() %>% 
  dplyr::mutate(sumnumeric = sum(c_across(where(is.numeric)), na.rm = T))
# %>% dplyr::ungroup() # you'll likely want to ungroup after using rowwise()

To select rows by column name:

You can use any number of tidy selection helpers like starts_with, ends_with, contains, etc.

df %>%
    dplyr::rowwise() %>% 
    dplyr::mutate(sum_startswithx = sum(c_across(starts_with("x")), na.rm = T))

rowise() will work for any summary function. However, in your specific case a row-wise variant exists (rowSums) so you can do the following (note the use of across instead), which will be faster:

df %>%
  dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T))

For more information see the page on rowwise.


Benchmarking

For this example, the the row-wise variant rowSums takes about half as much time:

library(microbenchmark)

microbenchmark(
  df %>%
    dplyr::rowwise() %>% 
    dplyr::mutate(sumrange = sum(dplyr::c_across(x1:x5), na.rm = T)),
  df %>%
    dplyr::mutate(sumrow = rowSums(dplyr::across(x1:x5), na.rm = T)),
  times = 1000L
)

    min    lq     mean  median      uq     max neval cld
 5.5256 6.256 7.024232 6.58885 7.02325 22.1911  1000   b
 2.7011 3.112 3.661106 3.41070 3.71975 32.6282  1000  a 
LMc
  • 5,865
  • 2
  • 12
  • 31