169

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

  1. I read the train file:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    
  2. I change the type of the categorical features to 'category':

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    
  3. I use one hot encoding:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    

The problem is that the 3'rd part often get stuck, although I am using a strong machine.

Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

What do you recommend?

yatu
  • 75,195
  • 11
  • 47
  • 89
avicohen
  • 2,057
  • 5
  • 14
  • 15

20 Answers20

189

Approach 1: You can use pandas' pd.get_dummies.

Example 1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

Example 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

Approach 2: Use Scikit-learn

Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

yatu
  • 75,195
  • 11
  • 47
  • 89
Sayali Sonawane
  • 10,231
  • 4
  • 36
  • 44
  • 26
    setting `drop_first=True` with `get_dummies` removes the need to drop the original column separately – OverflowingTheGlass Feb 28 '18 at 15:14
  • 1
    In example 2, is there a way to join the new columns to the dataframe without using join? I'm dealing with a really big dataset and get MemoryError when I try to do that. – J.Dahlgren May 31 '18 at 12:27
  • You can add new column to dataframe without using join if you have df2 with same no of rows. you can copy using : df[“newColname”]= df2[“col”] – Sayali Sonawane Jun 01 '18 at 12:49
  • 1
    Using an image for example 2 was evil – villasv Nov 05 '18 at 19:31
  • 15
    @OverflowingTheGlass- drop-first= True does not remove the original column. It drops the first level of the categorical feature so that you end up with k-1 columns instead of k columns, k being the cardinality of the categorical feature. – Garima Jain Feb 28 '19 at 13:50
  • Pandas get_dummies is great, but why it put the one hot vector in different cols, as we need it in just one col, in the shape of a vector? – keramat Sep 29 '19 at 13:52
  • @keramat as the name suggests, that comes in handy if someone wants to create indicator (dummy variables) features for modelling. – Sayali Sonawane Sep 30 '19 at 14:49
  • 2
    the df.join() does not work here, it creates more rows... do not know why though. – Chenxi Zeng Oct 06 '19 at 21:10
  • @ChenxiZeng Here, column 'B' is replaced by its indicator variables/one hot encoding columns. Hence, its is creating more columns. please refer: https://www.statisticssolutions.com/dummy-coding-the-how-and-why/ for more information about dummy variables. – Sayali Sonawane Oct 07 '19 at 08:49
  • I understand it Sayali. I mean at my end, it creates more rows... I guess something with the method join. – Chenxi Zeng Oct 08 '19 at 16:03
  • @ChenxiZeng Technically, the number of rows must remain constant. In the above example, if you use df.shape before using pd.get_dummies and after using pd.get_dummies will be same. Please check if you are performing some other operations after using it. You can also use print(df.shape) after every operation that will help you understand which operation is changing a number of rows. – Sayali Sonawane Oct 08 '19 at 22:01
  • df.join() creates more rows for me, so I used pd.concat([alldata, cat_encoded], axis=1) to join the encoded columns with the original dataset – Ajay Bhasy Dec 07 '20 at 13:17
81

Much easier to use Pandas for basic one-hot encoding. If you're looking for more options you can use scikit-learn.

For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.

For example, if I have a dataframe called imdb_movies:

enter image description here

...and I want to one-hot encode the Rated column, I do this:

pd.get_dummies(imdb_movies.Rated)

enter image description here

This returns a new dataframe with a column for every "level" of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.

Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using "column-binding.

We can column-bind by using Pandas concat function:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

enter image description here

We can now run an analysis on our full dataframe.

SIMPLE UTILITY FUNCTION

I would recommend making yourself a utility function to do this quickly:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

Usage:

encode_and_bind(imdb_movies, 'Rated')

Result:

enter image description here

Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 

You can encode multiple features at the same time as follows:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)
Harry B
  • 2,511
  • 1
  • 17
  • 40
Cybernetic
  • 9,374
  • 12
  • 69
  • 99
  • 2
    I would suggest dropping the original feature_to_encode after you concatenate the one hot ended columns with the original dataframe. – pmalbu Feb 01 '19 at 22:58
  • Added this option to answer. Thanks. – Cybernetic Feb 05 '19 at 22:42
  • Would it also work with the 'Genre' variable , i.e. when there are more than one description in the column? Would that still be one hot encoding? Sorry, for asking this here, but I am not sure it deserves (yet) another question. – Sapiens Aug 27 '20 at 22:00
  • @Sapiens Yes, it would still be considered hot encoding, where each level would be the unique genre combination a movie belongs to. Another option is to hot encode each genre a movie belongs to into the encoded vector (so one movie with three genres would have an encoded vector with three 1s and the rest 0s). – Cybernetic Aug 27 '20 at 22:25
31

You can do it with numpy.eye and a using the array element selection mechanism:

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

The the return value of indices_to_one_hot(nb_classes, data) is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).

Martin Thoma
  • 91,837
  • 114
  • 489
  • 768
22

Firstly, easiest way to one hot encode: use Sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Secondly, I don't think using pandas to one hot encode is that simple (unconfirmed though)

Creating dummy variables in pandas for python

Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.

Here's the code for my custom encoding function if you want.

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

EDIT: Comparison to be clearer:

One-hot encoding: convert n levels to n-1 columns.

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

Dummy Coding:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.

Community
  • 1
  • 1
Wboy
  • 2,010
  • 2
  • 19
  • 36
  • 1
    1. I have a data set which has 80% categorical variables. To my understanding i must use one hot encoding if i want to use a classifier for this data, else in the case of not doing the one hot encoding the classifier won't treat the categorical variables in the correct way? Is there an option not to encode? 2. If i use pd.get_dummies(train_small, sparse=True) with the saprse=True - doesn't that solves the memory problem? 3. How should i approach such a problem? – avicohen May 18 '16 at 09:15
  • As I said, there are two options. 1) One hot encode --> convert every level in categorical features to a new column. 2)Dummy coding --> convert every column to numeric representations. I'll edit my answer above to be clearer. But you can just run the function i provided and it should work – Wboy May 18 '16 at 09:51
  • Something is unclear to me: if i use pd.get_dummies, it converts every level in categorical features to a new column. However you say dummy are only levels? To my understanding if i don't use one hot encoding i lose some of the meaning of the data, and the algorithm does not treat the'm correctly? – avicohen May 18 '16 at 10:15
  • I think you're misunderstanding me. Levels is the number of unique values in a categorical column. eg, [dog,cat,mouse,dog] --> levels = 3. So if you want to hot encode a column, you will get n-1 columns for n levels. Dummy coding merely converts the levels into numeric representations in the same column. so [dog,cat,mouse,dog] --> [0,1,2,0] – Wboy May 19 '16 at 02:47
  • @Wboy [dog,cat,mouse,dog] --> [0,1,2,0] is just label encoding. It is not one-hot encoding which is achieved via creating dummy feature. In `sk-learn` one-hot encoding is achieved by using `sklearn.preprocessing.OneHotEncoder` class or calling `get_dummies` method on pandas `DataFrame`. – Ranjan Kumar May 19 '16 at 12:24
  • @avicohen Have you tried `get_dummies` method without `sparse` parameter which is by default false? – Ranjan Kumar May 19 '16 at 12:28
  • @RanjanKumar Yes i know, if you check my answer that is what i said. I was explaining dummy coding (label encoding). – Wboy May 19 '16 at 14:31
  • 20
    "at the cost of a bit of accuracy." How can you say "a bit"? Maybe in some cases, but in others, the accuracy could be hurt a lot. This solution results in treating qualitative features as continuous which means your model will not learn from the data properly. – Josh Morel Sep 06 '16 at 12:23
  • 5
    As Josh said above, in your second example you end up telling the model that `mouse > cat > dog` but this is not the case. `get_dummies` is the most straight forward way of transferring categorical variables into model friendly data from my experience (albeit very limited) – Martin O Leary Jan 16 '17 at 20:06
  • I speak with no authority, but I believe that dummy encoding and one-hot encoding are the same thing (synonyms). I think that the dummy coding Wboy shows is not actually dummy encoding. It doesn't have a name other than 'transforming' dog/cat/mouse into a ranked ordinal variable, which distorts the original meaning. Please correct me if I'm wrong. pandas get_dummies() I believe will leave you with a one-hot encoding – user798719 May 05 '17 at 09:06
  • 8
    This solution is very dangerous as pointed out by some other comments. It arbitrarily assigns orders and distances to categorical variables. Doing so reduces model flexibility in a random way. For tree based models, such encoding reduces possible subsetting possibilities. For example, you can only get two possible splittings now [(0), (1,2)] and [(0,1),(2)], and the split [(0,2), (1)] is impossible. The loss is much more significant when the number of categories is high. – Random Certainty Dec 28 '17 at 01:48
20

One hot encoding with pandas is very easy:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

EDIT:

Another way to one_hot using sklearn's LabelBinarizer :

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)
Qy Zuo
  • 2,044
  • 19
  • 20
16

You can use numpy.eye function.

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

Result

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]
Dieter
  • 1,878
  • 1
  • 17
  • 35
8

pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.

one line code for one-hot-encoding:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
Arshdeep Singh
  • 479
  • 6
  • 7
4

Here is a solution using DictVectorizer and the Pandas DataFrame.to_dict('records') method.

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])
SherylHohman
  • 12,507
  • 16
  • 70
  • 78
Josh Morel
  • 1,101
  • 11
  • 17
3

One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:

def oneHotEncode2(df, le_dict = {}):
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder()
        try:
            if train:
                df[feature] = le_dict[feature].fit_transform(df[feature])
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, 
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1)
        except:
            print('Error encoding '+feature)
            #df[feature]  = df[feature].convert_objects(convert_numeric='force')
            df[feature]  = df[feature].apply(pd.to_numeric, errors='coerce')
    return (df, le_dict)

This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).

Tukeys
  • 31
  • 2
3

You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.

Garima Jain
  • 779
  • 5
  • 6
  • True but you have to inform catboost first which features are categorical as the algorithm cannot figure them out by itself. – agcala Oct 25 '20 at 16:55
3

You can do the following as well. Note for the below you don't have to use pd.concat.

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

You can also change explicit columns to categorical. For example, here I am changing the Color and Group

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
sushmit
  • 3,478
  • 2
  • 28
  • 30
2

I know I'm late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:

def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values
Rambatino
  • 3,669
  • 1
  • 26
  • 45
1

I used this in my acoustic model: probably this helps in ur model.

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))
remykarem
  • 1,467
  • 18
  • 22
yunus
  • 1,821
  • 1
  • 10
  • 11
0

To add to other questions, let me provide how I did it with a Python 2.0 function using Numpy:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

The line n_values = np.max(y_) + 1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example.

Demo project/tutorial where this function has been used: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

Guillaume Chevalier
  • 6,933
  • 5
  • 46
  • 67
0

This works for me:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

[0, 1, 2, 0]
scottlittle
  • 13,132
  • 5
  • 41
  • 62
0

It can and it should be easy as :

class OneHotEncoder:
    def __init__(self,optionKeys):
        length=len(optionKeys)
        self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}

Usage :

ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)
Ofek Ron
  • 7,601
  • 12
  • 47
  • 91
0

Expanding @Martin Thoma's answer

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]
0

Short Answer

Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).

import typing


def one_hot_encode(items: list) -> typing.List[list]:
    results = []
    # find the unique items (we want to unique items b/c duplicate items will have the same encoding)
    unique_items = list(set(items))
    # sort the unique items
    sorted_items = sorted(unique_items)
    # find how long the list of each item should be
    max_index = len(unique_items)

    for item in items:
        # create a list of zeros the appropriate length
        one_hot_encoded_result = [0 for i in range(0, max_index)]
        # find the index of the item
        one_hot_index = sorted_items.index(item)
        # change the zero at the index from the previous line to a one
        one_hot_encoded_result[one_hot_index] = 1
        # add the result
        results.append(one_hot_encoded_result)

    return results

Example:

one_hot_encode([2, 1, 1, 2, 5, 3])

# [[0, 1, 0, 0],
#  [1, 0, 0, 0],
#  [1, 0, 0, 0],
#  [0, 1, 0, 0],
#  [0, 0, 0, 1],
#  [0, 0, 1, 0]]
one_hot_encode([True, False, True])

# [[0, 1], [1, 0], [0, 1]]
one_hot_encode(['a', 'b', 'c', 'a', 'e'])

# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]

Long(er) Answer

I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else's algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:

A one-hot encoding function must:

  • handle list of various types (e.g. integers, strings, floats, etc.) as input
  • handle an input list with duplicates
  • return a list of lists corresponding (in the same order as) to the inputs
  • return a list of lists where each list is as short as possible

I tested many of the answers to this question and most of them fail on one of the requirements above.

Floyd
  • 1,377
  • 14
  • 22
0

Try this:

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head()

The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.

More information on category_encoders here.

Andrea Araldo
  • 923
  • 10
  • 16
-1

Here i tried with this approach :

import numpy as np
#converting to one_hot





def one_hot_encoder(value, datal):

    datal[value] = 1

    return datal


def _one_hot_values(labels_data):
    encoded = [0] * len(labels_data)

    for j, i in enumerate(labels_data):
        max_value = [0] * (np.max(labels_data) + 1)

        encoded[j] = one_hot_encoder(i, max_value)

    return np.array(encoded)
Aaditya Ura
  • 9,140
  • 4
  • 35
  • 62