How could I reduce a dataframe in R with aggregate (or similar) to only retain the 100 highest values for each group?

Question

I have a dataframe as such:

probe.id       gene.name   variance       databse
A_23_P100002   FAM174B     0.93285966     Database1
A_23_P100013   AP3S2       0.48936044     Database1
...
A_23_P100020   RBPMS2      0.77441359     Database2
A_23_P100072   AVEN        0.36194383     Database2
...

I am interested in reducing this dataframe so that only the 100 genes with the highest variances per database remain. It seems that aggregate could do the job, but I don't have an idea of how to write the function that I would pass to aggregate. I would greatly appreciate any help.

Thank you!

score 2 · Answer 1 · answered Feb 14 '15 at 03:18

2

There are a lot of ways to skin this cat so you'll get a variety of answers. In base R this one should work pretty well.

o <- ave(dat$variance, dat$database, FUN = order, decreasing = TRUE)
dat100 <- dat[o <= 100,]

answered Feb 14 '15 at 03:18

John

22,043
5
50
80

score 1 · Answer 2 · answered Feb 14 '15 at 02:51

1

try this:

library(dplyr)
myData %>% group_by(database) %>% arrange(desc(variance)) %>% slice(1:100)

answered Feb 14 '15 at 02:51

Jthorpe

8,342
2
36
50

2

`top_n` may be another option – akrun Feb 14 '15 at 04:23
For more details see http://stackoverflow.com/questions/27766054/getting-the-top-values-by-group-using-dplyr – talat Feb 14 '15 at 08:32

score 1 · Answer 3 · answered Feb 14 '15 at 13:42

1

try data.table

# assume DF is your data frame
setDT(DF)[order(-variance), .SD[1:100], by = database]
# setDT(DF) is to convert DF to data table which could be reverted back to a data frame using setDF(DF)

answered Feb 14 '15 at 13:42

KFB

3,421
3
13
18

How could I reduce a dataframe in R with aggregate (or similar) to only retain the 100 highest values for each group?

3 Answers3