I have a 180,000 x 400 dataframe where the rows correspond to users but every user has exactly two rows.
id date ...
1 2012 ...
3 2010 ...
2 2013 ...
2 2014 ...
1 2011 ...
3 2014 ...
I want to subset the data so that only the most recent row for each user is retained (i.e. the row with the highest value for date for each id).
I first tried using which()
looping ids
with an ifelse()
statement in sapply()
which was painfully slow (O(n^2)
I believe).
Then I tried sorting the df
by id
and then looping through in increments of two and comparing adjacent dates but this was also slow (I guess because loops in R are hopeless). The comparison of the two dates is the bottleneck as the sort was pretty much instant.
Is there a way to vectorize the comparison?
SOLUTION from Remove duplicates keeping entry with largest absolute value
aa <- df[order(df$id, -df$date), ] #sort by id and reverse of date
aa[!duplicated(aa$id),]
Runs very quickly!!