0

I am using rpart to run a regression tree analysis within the caret package using the oneSE option for the selection function. When I do, I often end up with a model with zero splits. It suggests that no model would be better than any model. Should this be happening?

Here's an example:

# set training controls
tc <- trainControl("repeatedcv", repeats=100, selectionFunction="oneSE", num=10)

# run the model
mod <- train(yvar ~ ., data=dat, method="rpart", trControl=tc)

# it runs.....
# look at the cptable of the final model
printcp(mod$finalModel)

Here's the model output:

> mod
No pre-processing
Resampling: Cross-Validation (10 fold, repeated 100 times) 

Summary of sample sizes: 81, 79, 80, 80, 80, 80, ... 

Resampling results across tuning parameters:

  cp      RMSE   Rsquared  RMSE SD  Rsquared SD
  0.0245  0.128  0.207     0.0559   0.23       
  0.0615  0.127  0.226     0.0553   0.241      
  0.224   0.123  0.193     0.0534   0.195      

RMSE was used to select the optimal model using  the one SE rule.
The final value used for the model was cp = 0.224. 

Here's the output of printcp:

Variables actually used in tree construction:

character(0)
Root node error: 1.4931/89 = 0.016777
n= 89 
CP nsplit rel error
1 0.22357      0         1

However, if I just run the model directly in rpart, I can see the larger, unpruned tree that was trimmed to the supposedly more parsimonious model above:

unpruned = rpart(yvar ~., data=dat)
printcp(unpruned)

Regression tree:
rpart(formula = yvar ~ ., data = dat)

Variables actually used in tree construction:
[1] c.n.ratio Fe.ppm    K.ppm     Mg.ppm    NO3.ppm  

Root node error: 1.4931/89 = 0.016777

n= 89 

    CP nsplit rel error xerror    xstd
1 0.223571      0   1.00000 1.0192 0.37045
2 0.061508      2   0.55286 1.1144 0.33607
3 0.024537      3   0.49135 1.1886 0.38081
4 0.010539      4   0.46681 1.1941 0.38055
5 0.010000      6   0.44574 1.2193 0.38000

Caret [I think] is trying to find the smallest tree whose RMSE is within 1 SD of the model with the lowest RMSE. This is similar to the 1-SE approach advocated in Venebles and Ripley. In this case, it seems to get stuck picking the model with no splits, even though it has no explanatory power.

Is this right? Is this OK? It seems there should be a rule to prevent selection of a model with no splits.

Richie Cotton
  • 107,354
  • 40
  • 225
  • 343
Guillemot
  • 1
  • 1

1 Answers1

0

Try eliminating selectionFunction="oneSE".

That should identify the depth with the smallest possible error. In doing so, there is some potential for "optimization bias" from picking the minimum observed RMSE, but I have found that to be small in practice.

Max

topepo
  • 11,469
  • 3
  • 37
  • 45
  • Thanks for your reply Max. Using the selectionFunction="best" option doesn't seem to really get around the outcome I am running into. Maybe there's another way of asking this... Is there a way to get rpart to try more surrogate splits initially so it doesn't get hung up on the initial split? In some cases, I can add additional variables to the dataset and get a tree model whose initial split is one of the variables from the original pool that failed to produce a tree. – Guillemot Nov 04 '13 at 18:15