Feature names from OneHotEncoder

Question

I am using OneHotEncoder to encode few categorical variables (eg - Sex and AgeGroup). The resulting feature names from the encoder are like - 'x0_female', 'x0_male', 'x1_0.0', 'x1_15.0' etc.

>>> train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})

>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder = OneHotEncoder()
>>> train_X_encoded = encoder.fit_transform(train_X[['Sex', 'AgeGroup']])

>>> encoder.get_feature_names()
>>> array(['x0_female', 'x0_male', 'x1_0.0', 'x1_15.0', 'x1_30.0', 'x1_45.0',
       'x1_60.0', 'x1_75.0'], dtype=object)

Is there a way to tell OneHotEncoder to create the feature names in such a way that the column name is added at the beginning, something like - Sex_female, AgeGroup_15.0 etc, similar to what Pandas get_dummies() does.

Thanks. Scikit-learn deals with arrays rather than dfs, so I don't think it stores column names. [This question](https://stackoverflow.com/questions/49433462/python-sklearn-how-to-get-feature-names-after-onehotencoder/51006351) is almost exactly the same as yours. — Josh Friedlander, Feb 07 '19 at 11:28
Possible duplicate of [Python SKLearn: How to Get Feature Names After OneHotEncoder?](https://stackoverflow.com/questions/49433462/python-sklearn-how-to-get-feature-names-after-onehotencoder) — Josh Friedlander, Feb 07 '19 at 11:28
Thank you @JoshFriedlander for sharing the other question. I know that Pandas get_dummies does what I am looking for (I have mentioned that in the question). I was interested to know if there is a way to achieve this using Scikit-learn OneHotEncoder. So the answers in the other question does not help me. However, your comment _Scikit-learn deals with arrays rather than dfs, so I don't think it stores column names._ indicates that it is not possible to achieve the same in Sklearn. — Supratim Haldar, Feb 07 '19 at 11:40
Yes, I think the OP in that question wanted the same as you, and was told that `get_dummies` was the only way to achieve it — Josh Friedlander, Feb 07 '19 at 11:48
It seems they are actively working on this in early 2019, based on recent commits and discussion. Eg "A disadavantage of using the ColumnTransformer is that in version 0.20 it is not yet possible to readily find which input columns correspond to which output columns of the column transformer in all cases." https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/ch04.html Also see https://github.com/scikit-learn/scikit-learn/commit/2480368856bdf09d99e96029b867e6e8b4a55920 — Paul, Feb 23 '19 at 17:05

score 38 · Accepted Answer · answered Mar 17 '19 at 12:15

38

You can pass the list with original column names to get_feature_names:

encoder.get_feature_names(['Sex', 'AgeGroup'])

will return:

['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
 'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75']

answered Mar 17 '19 at 12:15

kabochkov

664
6
11

Great! Thanks @kabochkov :) – Supratim Haldar Mar 17 '19 at 19:44

score 15 · Answer 2 · answered May 17 '19 at 03:15

15

column_name = encoder.get_feature_names(['Sex', 'AgeGroup'])
one_hot_encoded_frame =  pd.DataFrame(train_X_encoded, columns= column_name)

answered May 17 '19 at 03:15

Nursnaaz

1,486
17
24

This is a better solution to the question in my opinion. thanks! – john Nov 18 '19 at 20:40
Thanks for the complement @john – Nursnaaz Nov 20 '19 at 17:02

score 1 · Answer 3 · answered Aug 19 '20 at 05:29

1

Thanks for a nice solution. @Nursnaaz The sparse matrix needs to convert into a dense matrix.

column_name = encoder.get_feature_names(['Sex', 'AgeGroup'])
one_hot_encoded_frame =  pd.DataFrame(train_X_encoded.todense(), columns= column_name)

answered Aug 19 '20 at 05:29

Swati

11
1

Feature names from OneHotEncoder

3 Answers3

Linked