34

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

Flimzy
  • 60,850
  • 13
  • 104
  • 147
hlin117
  • 16,266
  • 25
  • 66
  • 87
  • 1
    also pandas.get_dummies binary encoding gets treated as continuous by the decision tree classifier making it not applicable for that scenario. – vagabond Mar 28 '18 at 18:01

3 Answers3

8

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder
    ...: cat = OneHotEncoder()
    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    ...: cat.fit_transform(X).toarray()
    ...: 
Out[11]: array([[1., 0., 0., 1., 0.],
           [0., 1., 0., 0., 1.],
           [1., 0., 0., 1., 0.],
           [0., 0., 1., 0., 1.]])

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1.,  0., 0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encoders
import category_encoders as ce
le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1]])
zipp
  • 1,046
  • 13
  • 26
  • 6
    The CategorialEncoder has been merged with the OneHotEncoder so the functionality is contained in it in the current version of sklearn==0.20.dev0 – Kay Wittig Aug 28 '18 at 07:23
5

Very nice question.

However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn stages applicable to subsets of the X matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.

This is a classic case for using the Composite Design Pattern.

Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:

class ColumnApplier(object):
    def __init__(self, column_stages):
        self._column_stages = column_stages

    def fit(self, X, y):
        for i, k in self._column_stages.items():
            k.fit(X[:, i])

        return self

    def transform(self, X):
        X = X.copy()
        for i, k in self._column_stages.items():
            X[:, i] = k.transform(X[:, i])

        return X

Now, to use it in this context, starting with

X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
y = np.array([1, 2])
X

you would just use it to map each column index to the transformation you want:

multi_encoder = \
    ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))
multi_encoder.fit(X, None).transform(X)

Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.

Ami Tavory
  • 66,807
  • 9
  • 114
  • 153
  • 2
    I've created something like this before, to be honest. And it feels clunky. Scikit-Learn should have a class that abstracts this under the hood, just because it's a common design pattern. If it doesn't, then I think a PR for this would be appropriate. – hlin117 Jan 31 '16 at 00:19
  • Your solution gives problem on this data: X = np.array([['cat'],['dog','cat'],['pet','man'],['cat']]) y = [1,2,3,4] – Stepan Yakovenko Mar 24 '17 at 11:07
  • I usually do something similar to this too, but a known downfall is that `LabelEncoder#transform` crashes when seeing strings that didn't appear in training. – ldavid Nov 16 '17 at 23:49
  • 1
    shouldn't a `fit_transform` rectify this? – user2755526 Dec 04 '17 at 21:09
3

I've faced this problem many times and I found a solution in this book at his page 100 :

We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

and the sample code is here :

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(data)
housing_cat_1hot

and as a result : Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

And you can find more about the LabelBinarizer, here in the sklearn official documentation

Espoir Murhabazi
  • 4,079
  • 1
  • 32
  • 56
  • This fails in Python 3.6 on Windows 10. >>> housing_cat_1hot = encoder.fit_transform(data) ----------- Traceback (most recent call last): File "", line 1, in ,,, ,,, ,,, –  Sep 22 '17 at 18:36
  • This is not one hot encoding but dummy encoding. – zipp Dec 05 '17 at 17:46