Top "n" rows of each group using dplyr -- with different number per group

Question

I'll use the built-in chickwts data as an example.

Here's the data, there are 5 feed types.

> head(chickwts)

  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean

> table(chickwts$feed)

   casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12

What I want is the top rows by weight for every feed type. However, I need a different number for each feed type? For example,

top_n_feed <-
  c(
    "casein" = 3,
    "horsebean" = 5,
    "linseed" = 3,
    "meatmeal" = 6,
    "soybean" = 3,
    "sunflower" = 2
  )

How can I do this using dplyr?

To get the top n rows of each feed type by weight I can use code as below, but I'm not sure how to extend this to a different number for each feed type.

chickwts %>%
  group_by(feed) %>% 
  slice_max(order_by = weight, n = 5)

MrFlick · Accepted Answer · 2020-12-02T05:36:01.527

6

This isn't really something that dplyr names easy. I'd recommend merging in the data and then filtering.


tibble(feed=names(top_n_feed), topn=top_n_feed) %>% 
  inner_join(chickwts) %>% 
  group_by(feed) %>% 
  arrange(desc(weight), .by_group=TRUE) %>% 
  filter(row_number() <= topn) %>%
  select(-topn)

edited Dec 02 '20 at 05:36

answered Dec 02 '20 at 05:34

MrFlick

163,738
12
226
242

2

Great minds and all that... – thelatemail Dec 02 '20 at 05:35
Indeed, missed the part about ordering by weights. This will cover it for sure. – thelatemail Dec 02 '20 at 05:38

score 2 · Answer 2 · answered Dec 02 '20 at 09:49

Any time you have a named list think purrr::imap. Avoid joins if not required, particuarly when working at scale.

library(dplyr)
library(purrr)

top_n_feed <- c(
    "casein" = 3,
    "horsebean" = 5,
    "linseed" = 3,
    "meatmeal" = 6,
    "soybean" = 3,
    "sunflower" = 2
  )

imap_dfr(top_n_feed, ~ filter(chickwts, feed %in% .y) %>% 
           slice_max(order_by = weight, n = .x))

   weight      feed
1     404    casein
2     390    casein
3     379    casein
4     227 horsebean
5     217 horsebean
6     179 horsebean
7     168 horsebean
8     160 horsebean
9     309   linseed
10    271   linseed
11    260   linseed
12    380  meatmeal
13    344  meatmeal
14    325  meatmeal
15    315  meatmeal
16    303  meatmeal
17    263  meatmeal
18    329   soybean
19    327   soybean
20    316   soybean
21    423 sunflower
22    392 sunflower

mt1022 · Answer 3 · 2020-12-02T11:27:36.237

1

Another way using split and map2:

library(dplyr)
library(purrr)

chickwts %>%
filter(feed %in% names(top_n_feed)) %>%
split(.$feed) %>% 
map2_dfr(top_n_feed[names(.)], ~slice_max(.x, order_by = weight, n = .y))

edited Dec 02 '20 at 11:27

answered Dec 02 '20 at 05:48

mt1022

15,027
4
36
59

Cool approach -- could you explain the .$feed syntax in `split`? – 876868587 Dec 02 '20 at 09:03
1

This is syntax for column subsetting of data.frame. `.` represents the data.frame from the %>% piple (result of the last step). – mt1022 Dec 02 '20 at 11:25

score 0 · Answer 4 · answered Dec 02 '20 at 06:17

Bring top_n_feed in chickwts dataframe and select top n rows for each group.

library(dplyr)

tibble::enframe(top_n_feed, name = 'feed') %>% 
        left_join(chickwts, by = 'feed') %>%
        group_by(feed) %>%
        top_n(first(value), weight)

#   feed      value weight
#   <chr>     <dbl>  <dbl>
# 1 casein        3    390
# 2 casein        3    379
# 3 casein        3    404
# 4 horsebean     5    179
# 5 horsebean     5    160
# 6 horsebean     5    227
# 7 horsebean     5    217
# 8 horsebean     5    168
# 9 linseed       3    309
#10 linseed       3    260
# … with 12 more rows

For some reason I was not able to make slice_sample work for this example.

Top "n" rows of each group using dplyr -- with different number per group

4 Answers4