select rows with largest value of variable within a group in r

Question

a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))

a.3[r,]

returns the list index, not the index for the entire data.frame

Im trying to return the largest value of b.2 for each subgroup of a.2. How can I do this efficiently?

melt(a.3,id=c("a.2"))->h.2; cast(h.2,a.2~,max) DOes the trick in this example but the computer runs out of memory when I apply it to my original dataset. So didnt really help me much. — Misha, May 12 '10 at 20:15

score 10 · Answer 1 · answered Sep 07 '12 at 18:46

The ddply and ave approaches are both fairly resource-intensive, I think. ave fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated to select only the first row for each unique key. For the simple example here:

a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)

sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]

I think the performance gains over ave or ddply, at least, are substantial. It is slightly more complicated for multi-column keys, but order will handle a whole bunch of things to sort on and duplicated works on data frames, so it's possible to continue using this approach.

This was the easiest to use and works great on multiple columns -- all you need to do is to use `cbind` inside `duplicated`. — Josephine Moeller, Apr 07 '13 at 09:28

score 8 · Answer 2 · answered May 13 '10 at 12:54

8

library(plyr)
ddply(a.3, "a.2", subset, b.2 == max(b.2))

answered May 13 '10 at 12:54

hadley

94,313
27
170
239

I tried using the ddply function but it is painfully slow. I didnt time it but it lasted a coffecup and a trip to the bathroom whilst the ave version used only .2s in my original dataset (210col*16000rows). – Misha May 13 '10 at 22:52
1

That'll be fixed in the next version. But you can't expect to get answers that will work with your data unless you supply a realistic example! – hadley May 14 '10 at 03:04

score 6 · Accepted Answer · answered May 12 '10 at 23:35

6

a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

The answer by Jonathan Chang gets you what you explicitly asked for, but I'm guessing that you want the actual row from the data frame.

sel <- ave(b.2, a.2, FUN = max) == b.2
a.3[sel,]

answered May 12 '10 at 23:35

John

22,043
5
50
80

That was much simpler I must admit.. However the logic behind the == b.2 is beyond me... – Misha May 12 '10 at 23:59
The ave generates a vector that just contains the max of b.2 for every a.2. Therefore, where it == b.2 that sets a truth value as long as the data frame has rows. You're using the logical vector to select rows in the data frame. If you want to see how it's working add the result of the ave command to your data frame and look at it, comparing to the b.2 column -- a.3$b.max – John May 13 '10 at 02:05

score 1 · Answer 4 · answered May 12 '10 at 22:06

a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
m<-split(a.3,a.2)
u<-function(x){
    a<-rownames(x)
    b<-which.max(x[,2])
    as.numeric(a[b])
    }
r<-sapply(m,FUN=function(x) u(x))

a.3[r,]

This does the trick, albeit somewhat cumbersome...But it allows me to grab the rows for the groupwise largest values. Any other ideas?

score 1 · Answer 5 · answered May 12 '10 at 23:09

1

> a.2<-sample(1:10,100,replace=T)
> b.2<-sample(1:100,100,replace=T)
> tapply(b.2, a.2, max)
 1  2  3  4  5  6  7  8  9 10 
99 92 96 97 98 99 94 98 98 96

answered May 12 '10 at 23:09

Jonathan Chang

21,599
5
31
32

score 0 · Answer 6 · answered May 04 '17 at 14:26

a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

With aggregate, you can get the maximum for each group in one line:

aggregate(a.3, by = list(a.3$a.2), FUN = max)

This produces the following output:

   Group.1 a.2 b.2
1        1   1  96
2        2   2  82
...
8        8   8  85
9        9   9  93
10      10  10  97

select rows with largest value of variable within a group in r

6 Answers6