9
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))

a.3[r,]

returns the list index, not the index for the entire data.frame

Im trying to return the largest value of b.2 for each subgroup of a.2. How can I do this efficiently?

skaffman
  • 381,978
  • 94
  • 789
  • 754
Misha
  • 2,986
  • 6
  • 35
  • 55
  • melt(a.3,id=c("a.2"))->h.2; cast(h.2,a.2~,max) DOes the trick in this example but the computer runs out of memory when I apply it to my original dataset. So didnt really help me much. – Misha May 12 '10 at 20:15

6 Answers6

10

The ddply and ave approaches are both fairly resource-intensive, I think. ave fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated to select only the first row for each unique key. For the simple example here:

a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)

sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]

I think the performance gains over ave or ddply, at least, are substantial. It is slightly more complicated for multi-column keys, but order will handle a whole bunch of things to sort on and duplicated works on data frames, so it's possible to continue using this approach.

Aaron Schumacher
  • 3,126
  • 2
  • 21
  • 21
8
library(plyr)
ddply(a.3, "a.2", subset, b.2 == max(b.2))
hadley
  • 94,313
  • 27
  • 170
  • 239
  • I tried using the ddply function but it is painfully slow. I didnt time it but it lasted a coffecup and a trip to the bathroom whilst the ave version used only .2s in my original dataset (210col*16000rows). – Misha May 13 '10 at 22:52
  • 1
    That'll be fixed in the next version. But you can't expect to get answers that will work with your data unless you supply a realistic example! – hadley May 14 '10 at 03:04
6
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

The answer by Jonathan Chang gets you what you explicitly asked for, but I'm guessing that you want the actual row from the data frame.

sel <- ave(b.2, a.2, FUN = max) == b.2
a.3[sel,]
John
  • 22,043
  • 5
  • 50
  • 80
  • That was much simpler I must admit.. However the logic behind the == b.2 is beyond me... – Misha May 12 '10 at 23:59
  • The ave generates a vector that just contains the max of b.2 for every a.2. Therefore, where it == b.2 that sets a truth value as long as the data frame has rows. You're using the logical vector to select rows in the data frame. If you want to see how it's working add the result of the ave command to your data frame and look at it, comparing to the b.2 column -- a.3$b.max – John May 13 '10 at 02:05
1
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
m<-split(a.3,a.2)
u<-function(x){
    a<-rownames(x)
    b<-which.max(x[,2])
    as.numeric(a[b])
    }
r<-sapply(m,FUN=function(x) u(x))

a.3[r,]

This does the trick, albeit somewhat cumbersome...But it allows me to grab the rows for the groupwise largest values. Any other ideas?

Misha
  • 2,986
  • 6
  • 35
  • 55
1
> a.2<-sample(1:10,100,replace=T)
> b.2<-sample(1:100,100,replace=T)
> tapply(b.2, a.2, max)
 1  2  3  4  5  6  7  8  9 10 
99 92 96 97 98 99 94 98 98 96 
Jonathan Chang
  • 21,599
  • 5
  • 31
  • 32
0
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

With aggregate, you can get the maximum for each group in one line:

aggregate(a.3, by = list(a.3$a.2), FUN = max)

This produces the following output:

   Group.1 a.2 b.2
1        1   1  96
2        2   2  82
...
8        8   8  85
9        9   9  93
10      10  10  97
esel
  • 679
  • 6
  • 15