r - select last n occurrences for each group

Question

Situation

I have a data frame df:

df <- structure(list(person = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    3L, 3L), .Label = c("pA", "pB", "pC"), class = "factor"), date = structure(c(16071, 
    16102, 16130, 16161, 16071, 16102, 16130, 16071, 16102), class = "Date")), .Names = c("person", 
    "date"), row.names = c(NA, -9L), class = "data.frame")

> df
  person       date
1     pA 2014-01-01
2     pA 2014-02-01
3     pA 2014-03-01
4     pA 2014-04-01
5     pB 2014-01-01
6     pB 2014-02-01
7     pB 2014-03-01
8     pC 2014-01-01
9     pC 2014-02-01

Question

How can I select the last 2 (or 'n') entries, ordered by date, for each person, so that I have a resulting data frame df1:

> df1
  person       date
1     pA 2014-03-01
2     pA 2014-04-01
3     pB 2014-02-01
4     pB 2014-03-01
5     pC 2014-01-01
6     pC 2014-02-01

?

I've tried combinations of

library(dplyr)
df1 <- df %>%
    group_by(person) %>% 
    select(tail(df, 2))

with no joy.

score 7 · Accepted Answer · answered Jan 28 '15 at 11:14

7

You can try slice

library(dplyr)
df %>% 
    group_by(person) %>% 
    arrange(date, person) %>%
    slice((n()-1):n())
#  person       date
#1     pA 2014-03-01
#2     pA 2014-04-01
#3     pB 2014-02-01
#4     pB 2014-03-01
#5     pC 2014-01-01
#6     pC 2014-02-01

Or in place of the last step

   do(tail(., 2))

answered Jan 28 '15 at 11:14

akrun

674,427
24
381
486

@tospig `do` is useful, but I would think `slice` to be faster. – akrun Jan 28 '15 at 11:26
1

I am using `dplyr_0.4.0`. which version of dplyr you have ? – akrun Jan 28 '15 at 11:35
Ah yes, I was on `0.2` - have updated now to `0.4.1` and it works. – tospig Jan 28 '15 at 11:38
And `slice` is faster on my actual data set (`1600 rows x 16 variables`) – tospig Jan 28 '15 at 11:40

score 6 · Answer 2 · answered Jan 28 '15 at 11:25

6

Using data.table:

setDT(df)[order(person), tail(.SD, 2L), by=person]
#    person       date
# 1:     pA 2014-03-01
# 2:     pA 2014-04-01
# 3:     pB 2014-02-01
# 4:     pB 2014-03-01
# 5:     pC 2014-01-01
# 6:     pC 2014-02-01

We order by person and then group by person and select the last two rows from the subset of data .SD for each group.

answered Jan 28 '15 at 11:25

Arun

108,644
21
263
366

1

For some reason I've stayed away from learning `data.table` syntax. Maybe I should start to learn. – tospig Jan 28 '15 at 11:28
1

You can start with [these vignettes](https://github.com/Rdatatable/data.table/issues/944). – Arun Jan 28 '15 at 11:41

score 3 · Answer 3 · edited May 23 '17 at 12:06

Since you order the data by person and date (i.e. you want the 2 latest dates per person), you can also use top_n() in dplyr:

df %>% group_by(person) %>% top_n(2, date)
#Source: local data frame [6 x 2]
#Groups: person
#
#  person       date
#1     pA 2014-03-01
#2     pA 2014-04-01
#3     pB 2014-02-01
#4     pB 2014-03-01
#5     pC 2014-01-01
#6     pC 2014-02-01

Or, if you already order it, you could arrange it the other way before using slice:

df %>% arrange(person, desc(date)) %>% group_by(person) %>% slice(1:2)
#Source: local data frame [6 x 2]
#Groups: person
#
#  person       date
#1     pA 2014-04-01
#2     pA 2014-03-01
#3     pB 2014-03-01
#4     pB 2014-02-01
#5     pC 2014-02-01
#6     pC 2014-01-01

See here for a benchmark of a similar question.

r - select last n occurrences for each group

3 Answers3

Related