4

I have a problem that I have reduced to the following task. For a dataframe with IDs and dates;

set.seed(123)
myids <- sample(c('a001', 'a002', 'a003'), 12, replace = TRUE)
mydates <- as.Date(sample(c("2007-06-22", "2004-02-13", "2007-05-22", "2001-10-10", "2008-05-05", "2004-02-15"), 12, replace = TRUE))
mydf <- data.frame(myids, mydates)

I need to select only the row with the most recent date, for each subject. The result should be:

a001    5/5/08
a002    5/5/08
a003    2/15/04

Anyone know how to do this?

marcel
  • 311
  • 1
  • 6
  • 19
  • 4
    Try `library(dplyr); mydf %>% group_by(myids) %>% summarise(mydates=format(max(mydates), '%m/%d/%y'))` or if you have many columns `mydf %>% group_by(myids) %>% slice(which.max(mydates))` – akrun Sep 09 '15 at 15:18
  • 3
    or `aggregate(mydates~., mydf, max)` – Pierre L Sep 09 '15 at 15:22
  • 6
    This is *NOT* a duplicate. OP does not ask specifically for a `dplyr` solution, and there are many other (IMO better) ways to do it. – jlhoward Sep 09 '15 at 15:42

1 Answers1

8

Here's a data.table solution.

library(data.table)
setDT(mydf)[,.SD[which.max(mydates)],keyby=myids]
#    myids    mydates
# 1:  a001 2008-05-05
# 2:  a002 2008-05-05
# 3:  a003 2004-02-15
jlhoward
  • 52,898
  • 6
  • 81
  • 125