Filter dataframe by maximum values in each group

Question

I have a 180,000 x 400 dataframe where the rows correspond to users but every user has exactly two rows.

id   date  ...
1    2012    ...
3    2010    ...
2    2013    ...
2    2014    ...
1    2011    ...
3    2014    ...

I want to subset the data so that only the most recent row for each user is retained (i.e. the row with the highest value for date for each id).

I first tried using which() looping ids with an ifelse() statement in sapply() which was painfully slow (O(n^2) I believe).

Then I tried sorting the df by id and then looping through in increments of two and comparing adjacent dates but this was also slow (I guess because loops in R are hopeless). The comparison of the two dates is the bottleneck as the sort was pretty much instant.

Is there a way to vectorize the comparison?

SOLUTION from Remove duplicates keeping entry with largest absolute value

aa <- df[order(df$id, -df$date), ] #sort by id and reverse of date
aa[!duplicated(aa$id),]

Runs very quickly!!

David Arenburg · Accepted Answer · 2014-12-17T21:44:56.433

29

Here's a simple and fast approach using data.table package

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
#    id date
# 1:  1 2012
# 2:  3 2014
# 3:  2 2014

Or (could be a bit faster because of keyed by

setkey(setDT(df), id)[, .SD[which.max(date)], id]

Or using OPs idea via the data.table package

unique(setorder(setDT(df), id, -date), by = "id")

Or

setorder(setDT(df), id, -date)[!duplicated(id)]

Or base R solution

with(df, tapply(date, id, function(x) x[which.max(x)]))
##    1    2    3 
## 2012 2014 2014

Another way

library(dplyr)
df %>%
  group_by(id) %>%
  filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
# 
#   id date
# 1  1 2012
# 2  2 2014
# 3  3 2014

Or

df %>%
  group_by(id) %>%
  slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties

Or

df %>%
  group_by(id) %>%
  summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties

edited Dec 17 '14 at 21:44

answered Dec 17 '14 at 20:35

David Arenburg

87,271
15
123
181

1

The dplyr options in your answer produce the same output for the small sample data, but for other data each may behave different: `filter` will keep all existing columns but allow multiple rows in case of ties; `slice` will keep all columns but won't return multiple rows in case of ties; and `summarise` will remove all other columns and wont return multiple rows in case of ties. Just as an info for people who see this later on .. – talat Dec 17 '14 at 21:40
@docendodiscimus These are great observations. You are free to add these explanations to my solution as my `dplyr` guru – David Arenburg Dec 17 '14 at 21:41
Well, I added them already in the comments, so not necessary to put it in the answer. :) – talat Dec 17 '14 at 21:42
@docendodiscimus, Ok, added. – David Arenburg Dec 17 '14 at 21:45

talat · Answer 2 · 2014-12-17T21:30:18.663

6

aggregate should also work:

aggregate(date ~ id, df, max)

edited Dec 17 '14 at 21:30

answered Dec 17 '14 at 20:43

talat

62,625
18
110
141

Would be very very slow though for a big data set... – David Arenburg Dec 17 '14 at 20:43
3

actually it takes 1.5 seconds on my crummy laptop @DavidArenburg. Your base solution takes < 1 sec, so cheers to you – rawr Dec 17 '14 at 20:45
@rawr for what data size? – David Arenburg Dec 17 '14 at 20:46
1

`dat data.frame 578922376 552.1 Mb 180000 402` – rawr Dec 17 '14 at 20:46

Filter dataframe by maximum values in each group

2 Answers2

Linked

Related