Getting the top values by group

Question

Here's a sample data frame:

d <- data.frame(
  x   = runif(90),
  grp = gl(3, 30)
)

I want the subset of d containing the rows with the top 5 values of x for each value of grp.

Using base-R, my approach would be something like:

ordered <- d[order(d$x, decreasing = TRUE), ]    
splits <- split(ordered, ordered$grp)
heads <- lapply(splits, head)
do.call(rbind, heads)
##              x grp
## 1.19 0.8879631   1
## 1.4  0.8844818   1
## 1.12 0.8596197   1
## 1.26 0.8481809   1
## 1.18 0.8461516   1
## 1.29 0.8317092   1
## 2.31 0.9751049   2
## 2.34 0.9269764   2
## 2.57 0.8964114   2
## 2.58 0.8896466   2
## 2.45 0.8888834   2
## 2.35 0.8706823   2
## 3.74 0.9884852   3
## 3.73 0.9837653   3
## 3.83 0.9375398   3
## 3.64 0.9229036   3
## 3.69 0.8021373   3
## 3.86 0.7418946   3

Using dplyr, I expected this to work:

d %>%
  arrange_(~ desc(x)) %>%
  group_by_(~ grp) %>%
  head(n = 5)

but it only returns the overall top 5 rows.

Swapping head for top_n returns the whole of d.

d %>%
  arrange_(~ desc(x)) %>%
  group_by_(~ grp) %>%
  top_n(n = 5)

How do I get the correct subset?

Henrik · Accepted Answer · 2020-07-30T13:43:25.297

From dplyr 1.0.0, "slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n()."

d %>% group_by(grp) %>% slice_max(order_by = x, n = 5)
# # A tibble: 15 x 2
# # Groups:   grp [3]
#     x grp  
# <dbl> <fct>
#  1 0.994 1    
#  2 0.957 1    
#  3 0.955 1    
#  4 0.940 1    
#  5 0.900 1    
#  6 0.963 2    
#  7 0.902 2    
#  8 0.895 2    
#  9 0.858 2    
# 10 0.799 2    
# 11 0.985 3    
# 12 0.893 3    
# 13 0.886 3    
# 14 0.815 3    
# 15 0.812 3

Pre-dplyr 1.0.0 using top_n:

From ?top_n, about the wt argument:

The variable to use for ordering [...] defaults to the last variable in the tbl".

The last variable in your data set is "grp", which is not the variable you wish to rank, and which is why your top_n attempt "returns the whole of d". Thus, if you wish to rank by "x" in your data set, you need to specify wt = x.

d %>%
  group_by(grp) %>%
  top_n(n = 5, wt = x)

Data:

set.seed(123)
d <- data.frame(
  x = runif(90),
  grp = gl(3, 30))

@MatíasGuzmánNaranjo, https://stackoverflow.com/questions/21308436/dplyr-filter-get-rows-with-minimum-of-variable-but-only-the-first-if-multiple — nanselm2, Sep 20 '17 at 22:10

score 41 · Answer 2 · edited Sep 05 '18 at 07:47

Pretty easy with data.table too...

library(data.table)
setorder(setDT(d), -x)[, head(.SD, 5), keyby = grp]

Or

setorder(setDT(d), grp, -x)[, head(.SD, 5), by = grp]

Or (Should be faster for big data set because avoiding calling .SD for each group)

setorder(setDT(d), grp, -x)[, indx := seq_len(.N), by = grp][indx <= 5]

Edit: Here's how dplyr compares to data.table (if anyone's interested)

set.seed(123)
d <- data.frame(
  x   = runif(1e6),
  grp = sample(1e4, 1e6, TRUE))

library(dplyr)
library(microbenchmark)
library(data.table)
dd <- copy(d)

microbenchmark(
  top_n = {d %>%
             group_by(grp) %>%
             top_n(n = 5, wt = x)},
  dohead = {d %>%
              arrange_(~ desc(x)) %>%
              group_by_(~ grp) %>%
              do(head(., n = 5))},
  slice = {d %>%
             arrange_(~ desc(x)) %>%
             group_by_(~ grp) %>%
             slice(1:5)},
  filter = {d %>% 
              arrange(desc(x)) %>%
              group_by(grp) %>%
              filter(row_number() <= 5L)},
  data.table1 = setorder(setDT(dd), -x)[, head(.SD, 5L), keyby = grp],
  data.table2 = setorder(setDT(dd), grp, -x)[, head(.SD, 5L), grp],
  data.table3 = setorder(setDT(dd), grp, -x)[, indx := seq_len(.N), grp][indx <= 5L],
  times = 10,
  unit = "relative"
)


#        expr        min         lq      mean     median        uq       max neval
#       top_n  24.246401  24.492972 16.300391  24.441351 11.749050  7.644748    10
#      dohead 122.891381 120.329722 77.763843 115.621635 54.996588 34.114738    10
#       slice  27.365711  26.839443 17.714303  26.433924 12.628934  7.899619    10
#      filter  27.755171  27.225461 17.936295  26.363739 12.935709  7.969806    10
# data.table1  13.753046  16.631143 10.775278  16.330942  8.359951  5.077140    10
# data.table2  12.047111  11.944557  7.862302  11.653385  5.509432  3.642733    10
# data.table3   1.000000   1.000000  1.000000   1.000000  1.000000  1.000000    10

Adding a marginally faster data.table solution:

set.seed(123L)
d <- data.frame(
    x   = runif(1e8),
    grp = sample(1e4, 1e8, TRUE))
setDT(d)
setorder(d, grp, -x)
dd <- copy(d)

library(microbenchmark)
microbenchmark(
    data.table3 = d[, indx := seq_len(.N), grp][indx <= 5L],
    data.table4 = dd[dd[, .I[seq_len(.N) <= 5L], grp]$V1],
    times = 10L
)

timing output:

Unit: milliseconds
        expr      min       lq     mean   median        uq      max neval
 data.table3 826.2148 865.6334 950.1380 902.1689 1006.1237 1260.129    10
 data.table4 729.3229 783.7000 859.2084 823.1635  966.8239 1014.397    10

Adding another `data.table` method which should be slightly faster: `dt — chinsoon12, Sep 05 '18 at 06:15
@chinsoon12 be my guest. I don't have time to benchmark these solutions again. — David Arenburg, Sep 05 '18 at 07:22
Adding another `data.table` method easier : `setDT(d)[order(-x),x[1:5],keyby = .(grp)]` — Tao Hu, May 03 '19 at 14:39
@TaoHu it's pretty much like the first two solutions. I don't think `:` will beat `head` — David Arenburg, May 03 '19 at 14:43
@DavidArenburg Yeah，I agree with you,I think the most difference is `setorder` faster than `order` — Tao Hu, May 03 '19 at 14:49

score 34 · Answer 3 · edited Oct 30 '19 at 14:03

34

You need to wrap head in a call to do. In the following code, . represents the current group (see description of ... in the do help page).

d %>%
  arrange_(~ desc(x)) %>%
  group_by_(~ grp) %>%
  do(head(., n = 5))

As mentioned by akrun, slice is an alternative.

d %>%
  arrange_(~ desc(x)) %>%
  group_by_(~ grp) %>%
  slice(1:5)

Though I didn't ask this, for completeness, a possible data.table version is (thanks to @Arun for the fix):

setDT(d)[order(-x), head(.SD, 5), by = grp]

edited Oct 30 '19 at 14:03

David Arenburg

87,271
15
123
181

answered Jan 04 '15 at 13:36

Richie Cotton

107,354
40
225
343

1

@akrun Thanks. I didn't know about that function. – Richie Cotton Jan 04 '15 at 13:40
@DavidArenburg Thanks. That's what comes of posting an answer in hurry. I've removed the nonsense. – Richie Cotton Jan 04 '15 at 18:16
2

Richie, FWIW you just need a small addition: `setDT(d)[order(-x), head(.SD, 5L), by=grp]` – Arun Jan 04 '15 at 18:19
This answer is a bit outdated but the second part is the idomatic way if you drop the `~` and use `arrange` and `group_by` instead of `arrange_` and `group_by_` – Moody_Mudskipper Feb 18 '19 at 20:40

talat · Answer 4 · 2015-01-04T14:07:50.390

16

My approach in base R would be:

ordered <- d[order(d$x, decreasing = TRUE), ]
ordered[ave(d$x, d$grp, FUN = seq_along) <= 5L,]

And using dplyr, the approach with slice is probably fastest, but you could also use filter which will likely be faster than using do(head(., 5)):

d %>% 
  arrange(desc(x)) %>%
  group_by(grp) %>%
  filter(row_number() <= 5L)

dplyr benchmark

set.seed(123)
d <- data.frame(
  x   = runif(1e6),
  grp = sample(1e4, 1e6, TRUE))

library(microbenchmark)

microbenchmark(
  top_n = {d %>%
             group_by(grp) %>%
             top_n(n = 5, wt = x)},
  dohead = {d %>%
              arrange_(~ desc(x)) %>%
              group_by_(~ grp) %>%
              do(head(., n = 5))},
  slice = {d %>%
             arrange_(~ desc(x)) %>%
             group_by_(~ grp) %>%
             slice(1:5)},
  filter = {d %>% 
              arrange(desc(x)) %>%
              group_by(grp) %>%
              filter(row_number() <= 5L)},
  times = 10,
  unit = "relative"
)

Unit: relative
   expr       min        lq    median        uq       max neval
  top_n  1.042735  1.075366  1.082113  1.085072  1.000846    10
 dohead 18.663825 19.342854 19.511495 19.840377 17.433518    10
  slice  1.000000  1.000000  1.000000  1.000000  1.000000    10
 filter  1.048556  1.044113  1.042184  1.180474  1.053378    10

edited Jan 04 '15 at 14:07

answered Jan 04 '15 at 13:53

talat

62,625
18
110
141

@akrun `filter` requires an additional function, while your `slice` version doesn't... – David Arenburg Jan 04 '15 at 14:11
1

You know why you didn't add `data.table` here ;) – David Arenburg Jan 04 '15 at 14:30
5

I know it and I can tell you: because the question was asking specifically for a dplyr solution. – talat Jan 04 '15 at 14:37
1

I was just kidding... It is not like you never [did the same](http://stackoverflow.com/questions/25443658/r-add-new-columns-to-a-data-table-containing-many-variables/25443993#25443993) (just on the opposite direciton). – David Arenburg Jan 04 '15 at 14:43
@DavidArenburg, I wasn't saying it's "illegal" or anything the like to provide a data.table answer.. Of course you can do that and provide any benchmark you like :) Btw, the question you linked to is a nice example where dplyr syntax is way more convenient (I know, subjective!) than data.table. – talat Jan 04 '15 at 14:53

score 2 · Answer 5 · answered Sep 09 '17 at 19:43

top_n(n = 1) will still return multiple rows for each group if the ordering variable is not unique within each group. In order to select precisely one occurence for each group, add an unique variable to each row:

set.seed(123)
d <- data.frame(
  x   = runif(90),
  grp = gl(3, 30))

d %>%
  mutate(rn = row_number()) %>% 
  group_by(grp) %>%
  top_n(n = 1, wt = rn)

score 0 · Answer 6 · answered Oct 30 '19 at 07:48

0

One more data.table solution to highlight its concise syntax:

setDT(d)
d[order(-x), .SD[1:5], grp]

answered Oct 30 '19 at 07:48

sindri_baldur

22,360
2
25
48

Getting the top values by group

6 Answers6

Data:

dplyr benchmark

Linked

Related