30

I am using OneHotEncoder to encode few categorical variables (eg - Sex and AgeGroup). The resulting feature names from the encoder are like - 'x0_female', 'x0_male', 'x1_0.0', 'x1_15.0' etc.

>>> train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})

>>> from sklearn.preprocessing import OneHotEncoder
>>> encoder = OneHotEncoder()
>>> train_X_encoded = encoder.fit_transform(train_X[['Sex', 'AgeGroup']])
>>> encoder.get_feature_names()
>>> array(['x0_female', 'x0_male', 'x1_0.0', 'x1_15.0', 'x1_30.0', 'x1_45.0',
       'x1_60.0', 'x1_75.0'], dtype=object)

Is there a way to tell OneHotEncoder to create the feature names in such a way that the column name is added at the beginning, something like - Sex_female, AgeGroup_15.0 etc, similar to what Pandas get_dummies() does.

Supratim Haldar
  • 1,976
  • 2
  • 12
  • 24
  • 1
    Thanks. Scikit-learn deals with arrays rather than dfs, so I don't think it stores column names. [This question](https://stackoverflow.com/questions/49433462/python-sklearn-how-to-get-feature-names-after-onehotencoder/51006351) is almost exactly the same as yours. – Josh Friedlander Feb 07 '19 at 11:28
  • Possible duplicate of [Python SKLearn: How to Get Feature Names After OneHotEncoder?](https://stackoverflow.com/questions/49433462/python-sklearn-how-to-get-feature-names-after-onehotencoder) – Josh Friedlander Feb 07 '19 at 11:28
  • 1
    Thank you @JoshFriedlander for sharing the other question. I know that Pandas get_dummies does what I am looking for (I have mentioned that in the question). I was interested to know if there is a way to achieve this using Scikit-learn OneHotEncoder. So the answers in the other question does not help me. However, your comment _Scikit-learn deals with arrays rather than dfs, so I don't think it stores column names._ indicates that it is not possible to achieve the same in Sklearn. – Supratim Haldar Feb 07 '19 at 11:40
  • Yes, I think the OP in that question wanted the same as you, and was told that `get_dummies` was the only way to achieve it – Josh Friedlander Feb 07 '19 at 11:48
  • 3
    It seems they are actively working on this in early 2019, based on recent commits and discussion. Eg "A disadavantage of using the ColumnTransformer is that in version 0.20 it is not yet possible to readily find which input columns correspond to which output columns of the column transformer in all cases." https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/ch04.html Also see https://github.com/scikit-learn/scikit-learn/commit/2480368856bdf09d99e96029b867e6e8b4a55920 – Paul Feb 23 '19 at 17:05

3 Answers3

38

You can pass the list with original column names to get_feature_names:

encoder.get_feature_names(['Sex', 'AgeGroup'])

will return:

['Sex_female', 'Sex_male', 'AgeGroup_0', 'AgeGroup_15',
 'AgeGroup_30', 'AgeGroup_45', 'AgeGroup_60', 'AgeGroup_75']
kabochkov
  • 664
  • 6
  • 11
15
column_name = encoder.get_feature_names(['Sex', 'AgeGroup'])
one_hot_encoded_frame =  pd.DataFrame(train_X_encoded, columns= column_name)
Nursnaaz
  • 1,486
  • 17
  • 24
1

Thanks for a nice solution. @Nursnaaz The sparse matrix needs to convert into a dense matrix.

column_name = encoder.get_feature_names(['Sex', 'AgeGroup'])
one_hot_encoded_frame =  pd.DataFrame(train_X_encoded.todense(), columns= column_name)
Swati
  • 11
  • 1