12

I am having a strange problem. I have successfully ran this code on my laptop, but when I try to run it on another machine first I get this warning Distribution not specified, assuming bernoulli ..., which I expect but then I get this error: Error in object$var.levels[[i]] : subscript out of bounds

library(gbm)
gbm.tmp <- gbm(subxy$presence ~ btyme + stsmi + styma + bathy,
                data=subxy,
                var.monotone=rep(0, length= 4), n.trees=2000, interaction.depth=3,
                n.minobsinnode=10, shrinkage=0.01, bag.fraction=0.5, train.fraction=1,
                verbose=F, cv.folds=10)

Can anybody help? The data structures are exactly the same, same code, same R. I am not even using a subscript here.

EDIT: traceback()

6: predict.gbm(model, newdata = my.data, n.trees = best.iter.cv)
5: predict(model, newdata = my.data, n.trees = best.iter.cv)
4: predict(model, newdata = my.data, n.trees = best.iter.cv)
3: gbmCrossValPredictions(cv.models, cv.folds, cv.group, best.iter.cv, 
       distribution, data[i.train, ], y)
2: gbmCrossVal(cv.folds, nTrain, n.cores, class.stratify.cv, data, 
       x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, 
       n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, 
       group)
1: gbm(subxy$presence ~ btyme + stsmi + styma + bathy, data = subxy,var.monotone = rep(0, length = 4), n.trees = 2000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 1, verbose = F, cv.folds = 10)

Could it have something to do because I moved the saved R workspace to another machine?

EDIT 2: ok so I have updated the gbm package on the machine where the code was working and now I get the same error. So at this point I am thinking that the older gbm package did perhaps not have this check in place or that the newer version has some problem. I don't understand gbm well enough to say.

Herman Toothrot
  • 1,186
  • 2
  • 20
  • 42
  • 1
    (1) It may not be the source of your problem, but your formula shouldn't use `$`; just do `presence ~ ...`. (2) One thing to check is that both machines have R set up the same way; for instance check `stringsAsFactors`. – joran Sep 05 '13 at 15:53
  • Where is this `subxy` data frame? If it's your own data, then please can you provide some sample data that reproduces the problem. A `traceback()` of where the error occurs would also be useful. – Richie Cotton Sep 05 '13 at 15:54
  • The default distribution for `gbm` is "bernoulli", so if you have an outcome with greater than two levels, wouldn't you expect to throw an error? – IRTFM Sep 05 '13 at 17:57
  • @joran I checked both, and they have no effect on the issue. – Herman Toothrot Sep 05 '13 at 19:48

2 Answers2

13

just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.

this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.

since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.

I'd suggest either:

A) use model.matrix() to one-hot encode your factor variables

B) keep setting different seeds until you get a CV split that doesn't have this error occur.

EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.

EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"

#Example data with low occurrences of a factor level:

set.seed(222)
data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
data$x2 = as.factor(data$x2)
data

      y         x1 x2
 [1,] 1 -0.2468959  2
 [2,] 0 -1.2155609  6
 [3,] 0  1.5614051  1
 [4,] 0  0.4273102  5
 [5,] 1 -1.2010235  5
 [6,] 1  1.0524585  8
 [7,] 0 -1.3050636  6
 [8,] 0 -0.6926076  4
 [9,] 1  0.6026489  3
[10,] 0 -0.1977531  7

#CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.

CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]

#build a model on the training... 

CV_model = lm(y ~ ., data = CV_train)
summary(CV_model)
#note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2

CV_test$x2
#in the test set, there are only levels 1 and 2.

#attempt to predict on the test set
predict(CV_model, CV_test)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
factor x2 has new levels 1, 2
dylanjf
  • 191
  • 1
  • 4
  • 1
    thanks for the answer, it's a bit over my head, I am not sure if I understand all of it. Why the same function works on the other computer? I never get this error. It's a bit strange. I don't want to modify the CV parameter. – Herman Toothrot Sep 05 '13 at 20:04
  • please see edit2 in the answer if that makes sense. Thank you – Herman Toothrot Sep 05 '13 at 20:24
  • 4
    so I can confirm that by deactivating the CV fold gbm works. Maybe it's a bug with the package? It was working in the previous package. Any CV number higher than 1 gives this error. So anytime it is used. – Herman Toothrot Sep 06 '13 at 10:25
  • 1
    hi dylanjf, would you be able to share an example of using model.matrix to encode factor variable please? – Eugene Yan Apr 04 '15 at 03:02
4

I encounter the same problem and end up solving it by changing one of the hidden function called predict.gbm in the gbm package. This function predict the testing set by the trained gbm object on the training set from the division by cross validation.

The problem is the passed testing set should only have the columns corresponding to the features, so you should modify the function.

Xiyao Long
  • 51
  • 1
  • 1
    "The problem is the passed testing set should only have the columns corresponding to the features, so you should modify the function." Thanks! This tripped me for a long time this morning. – Anirban Mukherjee Jul 25 '17 at 08:11