0

I have a dataframe as such:

probe.id       gene.name   variance       databse
A_23_P100002   FAM174B     0.93285966     Database1
A_23_P100013   AP3S2       0.48936044     Database1
...
A_23_P100020   RBPMS2      0.77441359     Database2
A_23_P100072   AVEN        0.36194383     Database2
...

I am interested in reducing this dataframe so that only the 100 genes with the highest variances per database remain. It seems that aggregate could do the job, but I don't have an idea of how to write the function that I would pass to aggregate. I would greatly appreciate any help.

Thank you!

Johnathan
  • 1,587
  • 3
  • 18
  • 27

3 Answers3

2

There are a lot of ways to skin this cat so you'll get a variety of answers. In base R this one should work pretty well.

o <- ave(dat$variance, dat$database, FUN = order, decreasing = TRUE)
dat100 <- dat[o <= 100,]
John
  • 22,043
  • 5
  • 50
  • 80
1

try this:

library(dplyr)
myData %>% group_by(database) %>% arrange(desc(variance)) %>% slice(1:100)
Jthorpe
  • 8,342
  • 2
  • 36
  • 50
1

try data.table

# assume DF is your data frame
setDT(DF)[order(-variance), .SD[1:100], by = database]
# setDT(DF) is to convert DF to data table which could be reverted back to a data frame using setDF(DF)
KFB
  • 3,421
  • 3
  • 13
  • 18