1

My goal is to find the most important features that differentiate two classes. It makes sense to use one of the many approaches for feature selection out there to do that.

But here's my problem: I have a lot of correlated features.

Usually the goal of feature selection would be to eliminate those redundant features. But the features have a semantic meaning and I want to avoid loosing that information.

So if a group of correlated features has strong predictive power for the class variable, I want them all to be identified as important. (Bonus problem: If I include ten correlated features in my model, their resulting weights will end up being only a tenth of their "actual" importance.)

Can you think of a feature selection approach which finds important features even if they show up in groups of correlated festures?

AutoMiner
  • 80
  • 5
  • If features correlate, why not to combine them in some manner, ie do some featute engineering ? – Drey Mar 22 '17 at 12:19
  • Thank you Drey. Good point! The problem is that my features are not perfectly correlated - neither with each other nor with the class variable. This makes it hard to combine the features in a meaningful way. I tried combining my binary features based on frequent itemsets but the result were very confusing feature combinations... – AutoMiner Mar 22 '17 at 13:09
  • 1
    1. If you would have perfectly correlated features that would be bad for your data in more than one way. 2. There is more than one way to engeneer features. 3. If you have few features you could apply PCA with k as the number of features you have. This would transform your data preserving variance and covariance. Furthermore it would allow you to (losslessly) transform your PCA data back into original feature space for interpretation. (but this is not feature engineering). 4. Use bijective tranformation on your data... 5. there is alot - you need to share specifics of your data. – Drey Mar 22 '17 at 13:22

3 Answers3

0

Can you think of a feature selection approach which finds important features even if they show up in groups of correlated festures?

Maybe this function can help you. I use this to find important features.

library(randomForest)
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE,
                          importance=TRUE)
varImpPlot(mtcars.rf)

I hope it can help you.

BigMOoO
  • 140
  • 10
0

I'd recommend to eliminate highly correlated features before, since they are redundant (some related explanation here). You can identify which ones have zero or near-zero variance; there are methods which identify columns which are linear combinations of the others (hence, can be safely removed without losing any information). Then rank the remaining features based on their predictive power using a typical features selection technique.

Community
  • 1
  • 1
marilena.oita
  • 787
  • 6
  • 12
0

An all important feature set can be found using the Boruta algorithm. This algorithm essentially measures the decrease in MSE caused by randomly changing the order of observations for a feature. Thus allowing the algorithm to decide if the feature contributes to the accuracy of the model. This is all very similar to the variable importance calculation in random forest. The details can be found in this paper : https://www.jstatsoft.org/article/view/v036i11/v36i11.pdf

Uzma
  • 21
  • 2