9

I'm trying to replace a column within a Pandas DataFrame containing strings into a one-hot encoded equivalent using Scikit-Learn's OneHotEncoder. My code below doesn't work:

from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame

jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

It produces the following error (strings in the list are omitted):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
      3 jobs_encoder = OneHotEncoder()
      4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    120                     msg = ("Found unknown categories {0} in column {1}"
    121                            " during transform".format(diff, i))
--> 122                     raise ValueError(msg)
    123                 else:
    124                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform

Here's some sample data:

data['Profession'] =

0         unkn
1         safe
2         rece
3         unkn
4         lead
          ... 
111988    indu
111989    seni
111990    mess
111991    seni
111992    proj
Name: Profession, Length: 111993, dtype: object

What exactly am I doing wrong?

dd.
  • 263
  • 1
  • 2
  • 11
  • Please include the *full* error trace, as well as a sample of your `data['Profession']`. – desertnaut Sep 25 '19 at 14:48
  • one hot encoder would return a 2d array of size `data_length x num_categories`. You cannot assign to a single column `df['Profession']`. – Quang Hoang Sep 25 '19 at 14:54
  • 1
    Followup on dd answer. We can use OneHotEncoder for multi column data, while not for LabelBinarizer and LabelEncoder. https://stackoverflow.com/a/54119850/1582366 – Novice Jun 20 '20 at 15:13

2 Answers2

12

OneHotEncoder Encodes categorical integer features as a one-hot numeric array. It's Transform method returns a sparse matrix if sparse=True else a 2-d array. You can't cast a 2-d array (or sparse matrix) into a Pandas Series. You must create a Pandas Serie (a column in a Pandas dataFrame) for each category.

I would recommand to use pandas.get_dummies insted:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

EDIT:

Using Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

Other Options: If you are doing hyperparameter tuning with GridSearch it's recommanded to use ColumnTransformer and FeatureUnion with Pipeline or directly make_column_transformer

Amine Benatmane
  • 698
  • 5
  • 13
  • 1
    I want to be able to pickle the instance to use it on new data in the future, that's why I want to use OneHotEncoder, that can't be done with get_dummies right? – dd. Sep 25 '19 at 15:34
  • That's right. If you want to use it on new data, you can't use get_dummies. – Abel Paz Jul 26 '20 at 10:55
11

So turned out that Scikit-Learns LabelBinarizer gave me better luck in converting the data to one-hot encoded format, with help from Amnie's solution, my final code is as follows

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)
dd.
  • 263
  • 1
  • 2
  • 11