Python - Scikit find variable importance for categorical variables

Question

I'm trying to use scikit learn in python to do a couple different classifier problems (RF, GBM, etc). In addition to building models and making predictions, I'd like to see variable importance. I know there is a way to get the importances

importances = clf.feature_importances_
print(importances)

but how do I get something more refined that has the importance connected to the variable name (ie summary(gbm) in R or varImp(randomForest) in R) especially if it's a categorical variable with multiple levels?

[This example plots feature importance](http://scikit-learn.org/0.13/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py). Could you make it more clear as to what you want ("more refined") - maybe what isn't shown in this example? — AGS, Mar 21 '15 at 20:46

score 4 · Answer 1 · answered May 21 '15 at 16:21

The variable importance (or feature importance) is calculated for all the features that you are fitting your model to. This pseudo code gives you an idea of how variable names and importance can be related:

import pandas as pd

train = pd.read_csv("train.csv")
cols = ['hour', 'season', 'holiday', 'workingday', 'weather', 'temp', 'windspeed']
clf = YourClassifiers()
clf.fit(train[cols], train.targets) # targets/labels

print len(clf.feature_importances_)
print len(cols)

You will see that the lengths of the two lists being printed are the same - you can essentially map the lists together or manipulate them how you wish. If you'd like to show variable importance nicely in a plot, you could use this:

import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(6 * 1.618, 6))
index = np.arange(len(cols))
bar_width = 0.35
plt.bar(index, clf.feature_importances_, color='black', alpha=0.5)
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Feature importance')
plt.xticks(index + bar_width, cols)
plt.tight_layout()
plt.show()

If you don't want to use this method (meaning that you are fitting all columns, not just selected few as set in cols variable), then you could get the column/feature/variable names of your data with train.columns.values (and then map this list together with variable importance list or manipulate in some other way).

Python - Scikit find variable importance for categorical variables

1 Answers1