0

I'm trying to make dummy variables in my input set of the following form: My Input set

So I encoded the categorical data so now my array is of the form: Encoded input set

Next, I would like to make dummy variables using OneHot Encoder. I know that it used to work this way:

onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()

But now the OneHotEncoder class works a bit different and I can't figure out how to adjust it to my dataset so it works exactly this way. My code:

import numpy as np
import pandas as pd

dataset = pd.DataFrame(
    {'RowNumber': [1, 2, 3, 4, 5],
     'CustomerId': [602, 311, 304, 354, 888],
     'Surname': ['Har', 'Hil', 'Oni', 'Bon', 'Mit'],
     'CreditScore': [619, 608, 502, 699, 850],
     'Geography': ['FR', 'ES', 'FR', 'FR', 'ES'],
     'Gender': ['F', 'F', 'F', 'F', 'F'],
     'Age': [42, 41, 42, 39, 43],
     'Tenure': [2, 1, 8, 0, 2]})

X = dataset.iloc[:, 3 : -1].values
y= dataset.iloc[:, -1].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])
X[:, 2] = le.fit_transform(X[:, 2])

# Making dummy variables
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

Thank you in advance!

user1953384
  • 959
  • 2
  • 11
  • 27
Harnold
  • 1
  • 3
  • Could you provide a sample input in the script in place of `"Churn_Modelling.csv"` to make your code runnable? Also, why do you say "dummy variables"? It looks like you're simply trying to create one-hot encodings of those two columns. – user1953384 Mar 27 '20 at 08:10
  • This is the first 5 rows of my dataset: RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited 1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1 2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0 3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1 4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0 5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0 – Harnold Mar 27 '20 at 08:19
  • Would be great if you could define that in the code in your post as `dataset = pd.DataFrame(...)`. – user1953384 Mar 27 '20 at 08:20
  • I'm really sorry but I quite don't get what you expect from me. I'm pretty new to python and machine learning so you need to be really specific – Harnold Mar 27 '20 at 08:25
  • I've updated your post to illustrate what I meant. – user1953384 Mar 27 '20 at 08:34
  • Does this answer your question? [How can I one hot encode in Python?](https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python) – user1953384 Mar 27 '20 at 08:34
  • Unfortunately, I don't understand how to implement this in my case.I've tried reading the: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html, but still, quite don't get what the input should look like. Btw, thanks for updating my post! I've figured it out myself after a while, but you were a bit faster :( – Harnold Mar 27 '20 at 08:42

2 Answers2

1

It turns out the API for OneHotEncoder has changed, as it says in the documentation. Now you need to pass a list of categories instead of just the categories (in order to be able to generate multiple one-hot encodings in the same call, if needed).

Does the following work as you expect?

import numpy as np
import pandas as pd

dataset = pd.DataFrame(
    {'RowNumber': [1, 2, 3, 4, 5],
     'CustomerId': [602, 311, 304, 354, 888],
     'Surname': ['Har', 'Hil', 'Oni', 'Bon', 'Mit'],
     'CreditScore': [619, 608, 502, 699, 850],
     'Geography': ['FR', 'ES', 'FR', 'FR', 'ES'],
     'Gender': ['F', 'F', 'F', 'F', 'F'],
     'Age': [42, 41, 42, 39, 43],
     'Tenure': [2, 1, 8, 0, 2]})

X = dataset.iloc[:, 3 : -1].values
y= dataset.iloc[:, -1].values

# Making dummy variables
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
X1 = ohe.fit_transform(list(map(lambda x: [x], X[:, 1]))).toarray()
X2 = ohe.fit_transform(list(map(lambda x: [x], X[:, 2]))).toarray()
user1953384
  • 959
  • 2
  • 11
  • 27
0

Use pandas.get_dummies() to create dummy variables for pandas dataframe:

df = pd.DataFrame({'Country':['France','Spain','Germany','France','Spain','Germany','Germany'],
                   'Gender':['Male','Female','Male','Female','Male','Male','Female'],
                   'Age':[52,30,38,45,41,55,29]})

df = pd.get_dummies(data = df, columns = ['Country','Gender'])
ManojK
  • 1,224
  • 2
  • 5
  • 14
  • 1
    I want to use the OneHotEncoder class. I've already done this with pandas and it works well, but thanks anyway! – Harnold Mar 27 '20 at 08:42