12

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages

FXQuantTrader
  • 6,221
  • 3
  • 31
  • 62
DOSMarter
  • 1,337
  • 4
  • 17
  • 28

2 Answers2

24

Not sure if you found what you were looking for, but I find some of these sheets less than helpful.

If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters

He are some rules of thumb for running GBM:

  1. The interaction.depth is 1, and on most data sets that seems adequate, but on a few I have found that testing the results against odd multiples up the max has given better results. The max value I have seen for this parameter is floor(sqrt(NCOL(training))).
  2. Shrinkage: the smaller the number, the better the predictive value, the more trees required, and the more computational cost. Testing the values on a small subset of data with something like shrinkage = shrinkage = seq(.0005, .05,.0005) can be helpful in defining the ideal value.
  3. n.minobsinnode: default is 10, and generally I don't mess with that. I have tried c(5,10,15,20) on small sets of data, and didn't really see an adequate return for computational cost.
  4. n.trees: the smaller the shrinkage, the more trees you should have. Start with n.trees = (0:50)*50 and adjust accordingly.

Example setup using the caret package:

getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <-  expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
                    n.trees = (0:50)*50, 
                    shrinkage = seq(.0005, .05,.0005),
                    n.minobsinnode = 10) # you can also put something        like c(5, 10, 15, 20)

fitControl <- trainControl(method = "repeatedcv",
                       repeats = 5,
                       preProcOptions = list(thresh = 0.95),
                       ## Estimate class probabilities
                       classProbs = TRUE,
                       ## Evaluate performance using
                       ## the following function
                       summaryFunction = twoClassSummary)

# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
            distribution = "adaboost",
            method = "gbm", bag.fraction = 0.5,
            nTrain = round(nrow(training) *.75),
            trControl = fitControl,
            verbose = TRUE,
            tuneGrid = gbmGrid,
            ## Specify which metric to optimize
            metric = "ROC"))

Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow. To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.

Shanemeister
  • 618
  • 6
  • 13
16

This link has a concrete example (page 10) - http://www.jstatsoft.org/v28/i05/paper

Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

Nishanth
  • 6,312
  • 5
  • 23
  • 36