How to select the rows with maximum values in each group with dplyr?

Question

I would like to select a row with maximum value in each group with dplyr.

Firstly I generate some random data to show my question

set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$value <- runif(nrow(df))

In plyr, I could use a custom function to select this row.

library(plyr)
ddply(df, .(A, B), function(x) x[which.max(x$value),])

In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).

library(dplyr)
df %>% group_by(A, B) %>%
    summarise(max = max(value))

How could I achieve this? Thanks for any suggestion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2  plyr_1.8.1

loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 parallel_3.1.0      Rcpp_0.11.1        
[4] tools_3.1.0

thelatemail · Accepted Answer · 2020-06-22T21:49:54.073

154

Try this:

result <- df %>% 
             group_by(A, B) %>%
             filter(value == max(value)) %>%
             arrange(A,B,C)

Seems to work:

identical(
  as.data.frame(result),
  ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE

As pointed out in the comments, slice may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.

edited Jun 22 '20 at 21:49

answered Jun 16 '14 at 06:10

thelatemail

81,120
12
111
172

16

The results are identical in this case because there are no duplicated maximum values present. Otherwise, the `filter` approach would return all maximum values (rows) per group while the OP's ddply approach with `which.max` would only return one maximum (the first) per group. To replicate that behavior, another option is to use `slice(which.max(value))` in dplyr. – talat Jun 25 '15 at 07:23
According with @talat comment the same result may be obtained using `filter(row_number()==1)`. Remenber to arrange the DF before grouping – Tiziano Nov 21 '20 at 17:50

score 79 · Answer 2 · answered Jun 16 '14 at 06:14

79

You can use top_n

df %>% group_by(A, B) %>% top_n(n=1)

This will rank by the last column (value) and return the top n=1 rows.

Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)

answered Jun 16 '14 at 06:14

mnel

105,872
25
248
242

6

note that ties will cause duplicates. – Triamus Apr 21 '18 at 13:33
if your value is not the last column you can use the wt parameter of top_n `df %>% group_by(A,B) %>% top_n(n=1, wt = value)` – squiddle Jun 28 '19 at 07:19

score 77 · Answer 3 · answered Feb 24 '16 at 16:40

77

df %>% group_by(A,B) %>% slice(which.max(value))

answered Feb 24 '16 at 16:40

RoyalTS

7,796
7
45
90

1

`slice_max(.data, order_by, ..., n, prop, with_ties = TRUE)` also now a good option... – Ben May 11 '21 at 17:50

score 12 · Answer 4 · answered Jul 18 '16 at 07:59

This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)

library(dplyr)
df %>% group_by(A, B) %>%
  mutate(the_rank  = rank(-value, ties.method = "random")) %>%
  filter(the_rank == 1) %>% select(-the_rank)

score 2 · Answer 5 · answered Jan 16 '19 at 19:06

More generally, I think you might want to get "top" of the rows that are sorted within a given group.

For the case of where a single value is max'd out, you have essentially sorted by only one column. However, it's often useful to hierarchically sort by multiple columns (for example: a date column and a time-of-day column).

# Answering the question of getting row with max "value".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in descending order by "value" column.
  arrange( desc(value) ) %>% 
  # Pick the top 1 value
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

# Answering an extension of the question of 
# getting row with the max value of the lowest "C".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in ascending order by C, and then within that by 
  # descending order by "value" column.
  arrange( C, desc(value) ) %>% 
  # Pick the one top row based on the sort
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

When you `arrange( desc(value) )` then value are grouped? I would like to have there say sum of C. — Przemyslaw Remin, Apr 09 '21 at 15:00
@PrzemyslawRemin, I'm not sure I fully understand where the "sum of C" is or how you mean to use it. In `dplyr` it's nice to separate your steps. You could `df %>% group_by( A, B) %>% mutate( s = sum(C) )` which will put the sum of C within each group as a (repeated) value `s` within each group (each row gets the same sum value within a group). And then you could `ungroup` and use one of the other methods described to filter max of C, such as `... %>% ungroup() %>% slice(which.max(C))`. — Kalin, Apr 20 '21 at 15:31

score 1 · Answer 6 · answered Feb 01 '19 at 14:39

1

For me, it helped to count the number of values per group. Copy the count table into a new object. Then filter for the max of the group based on the first grouping characteristic. For example:

count_table  <- df %>%
                group_by(A, B) %>%
                count() %>%
                arrange(A, desc(n))

count_table %>% 
    group_by(A) %>%
    filter(n == max(n))

or

count_table %>% 
    group_by(A) %>%
    top_n(1, n)

answered Feb 01 '19 at 14:39

ksvrd

11
2

2

If you run this on the data in the question, I think you'll see it's answering the wrong question. The question in the post is about how to find the rows that have the maximum value (the number in the `value` column). This answer ignores the `value` column to find the most common `B` value for each `A`. – Gregor Thomas Feb 01 '19 at 14:49

How to select the rows with maximum values in each group with dplyr?

6 Answers6

Linked

Related