5

From my understanding, One-Class SVM's are trained without target/label data.

One answer at Use of OneClassSVM with GridSearchCV suggests passing Target/Label data to GridSearchCV's fit method when the classifier is the OneClassSVM.

How does the GridSearchCV method handle this data?

Does it actually train the OneClassSVM without the Target/label data, and just use the Target/label data for evaluation?

I tried following the GridSearchCV source code, but I couldn't find the answer.

Venkatachalam
  • 12,957
  • 8
  • 35
  • 62
user3731622
  • 3,753
  • 2
  • 31
  • 64
  • If you have the label data, then why do you want to train a OneClassSVM? – MaximeKan Oct 01 '19 at 02:54
  • To test the OneClassSVM. :) If I train, a system with the train data & my test data the system will learn about the test data. I created a synthetic test case which I would like to use to evaluate how the OneClassSVM would do if it encountered the type of data. – user3731622 Oct 01 '19 at 18:37
  • If the purpose is to test the OneClassSVM, then you can do this without a GridSearchCV, because you would not tune your algo. However, if you also have train labels, then what is the benefit of this as opposed to having a supervised classifier learning from your train labels? – MaximeKan Oct 01 '19 at 23:25
  • The reason to use GridSearchCV is to uncover how well the algorithm could perform on an example hypothetical synthetic test subclass. I can do this writing my own code, but I'd like to utilize scikit-learn framework if possible. The benefit to doing this instead of utilizing a supervised classifier is that I want to train a system to learn things about 1 normal class & perform novelty detection. Again, the test case I have is hypothetical and doesn't include all types of data. – user3731622 Oct 02 '19 at 00:22
  • GridSearchCV is designed to tune parameters so that the model fits the train labels best. If you just want to see "how well the algorithm could perform on an example hypothetical synthetic test subclass", then you don't need GridSearchCV for this. You can just use the adjusted rand score or some other metric to assess this after the training – MaximeKan Oct 03 '19 at 03:03
  • 1
    The statement "GridSearchCV is designed to tune parameters so that the model fits the train labels best" is at least not always true. I believe, but could be wrong, that it uses cross validation which doesn't utilize training data at all. In addition, GridSearchCV supports unsupervised learning, which doesn't utilize training labels during the training process. This is described in the [documentation for the GridSearchCV's fit method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit) – user3731622 Oct 03 '19 at 21:29
  • You are correct, I wrote this too fast. The problem, if you use GridSearchCV, is that it's going to select the "best" model. But "best" relative to what? – MaximeKan Oct 03 '19 at 22:42
  • You're question about "best" model, relative to what is also discussed in the scikit-learn documentation in (I think) several places including [GridSearchCV documentation](https://scikit-learn.org/0.17/modules/grid_search.html#grid-search). If you're interested in learning about this I suggest you read the link in this comment and look closely at the discussion of cross-validation and scoring. – user3731622 Oct 04 '19 at 16:48

1 Answers1

4

Does it actually train the OneClassSVM without the Target/label data, and just use the Target/label data for evaluation?

Yes to both.

GridSearchCV does actually send labels to OneClassSVM in fit call, but OneClassSVM simply ignores it. Notice in the 2nd link how an array of ones is sent to main SVM trainer instead of given label array y. Parameters like y in fit exists only so that meta estimators like GridSearchCV can work in a consistent way without worrying about supervised/unsupervised estimators.

To actually test this, lets first detect outliers using GridSearchCV:

X,y = load_iris(return_X_y=True)
yd = np.where(y==0,-1,1)
cv = KFold(n_splits=4,random_state=42,shuffle=True)
model = GridSearchCV(OneClassSVM(),{'gamma':['scale']},cv=cv,iid=False,scoring=make_scorer(f1_score))
model = model.fit(X,yd)
print(model.cv_results_)

Note all the splitx_test_score in cv_results_.

Now lets do it manually, without sending labels yd during fit call:

for train,test in cv.split(X,yd):
    clf = OneClassSVM(gamma='scale').fit(X[train])  #Just features
    print(f1_score(yd[test],clf.predict(X[test])))

Both should yield exactly same scores.

Shihab Shahriar Khan
  • 3,771
  • 1
  • 14
  • 24
  • 1
    Great answer! I feel like this should be put as an example in scikit-learn GridSearchCV and OneClassSVM documents. – user3731622 Oct 11 '19 at 18:31