why I get in Z1 2 columns instead of 3 and how to fix it using hotEncoder

Question

I'm using hotEncoder for a column with 5 values witch gave me 5 columns (for Z). That's OK now I have another column with has 3 values but I got 2 columns instead of 3 in Z1 what I need to do in the code to fix that I'll get 3 columns in Z1?

also, I would like the explanation for the hotEncoder code. Why I have to use np.hstack here? Thank you very much!!

X = df.iloc[:, :-1].values 
Y = df.iloc[:, -1].values 

labelencoder_X5 = LabelEncoder()
labelencoder_X6 = LabelEncoder()

X[:, 5] = labelencoder_X5.fit_transform(X[:, 5])
X[:, 6] = labelencoder_X6.fit_transform(X[:, 6])

onehotencoder = OneHotEncoder(sparse=False)
Z= onehotencoder.fit_transform(X[:, [5]])
X = np.hstack(( Z, X[:,:5] , X[:,6:])).astype('float')
#handling the dummy variable trap
X = X[:, 1:]

onehotencoder = OneHotEncoder(sparse=False)
Z1= onehotencoder.fit_transform(X[:, [6]])
X = np.hstack(( Z1, X[:,:6] , X[:,7:])).astype('float')
#handling the dummy variable trap
X = X[:, 1:]

score 0 · Accepted Answer · answered Jul 08 '20 at 18:23

You get Z1 by transforming the 6th column, but that is not the sixth column from the original data. In the previous block of code, you've put the 5th (original) column's dummy variables first when redefining X with the hstack.

But, some generic comments are in order.

You do not need to label-encode before onehot-encoding; that's an artifact from old sklearn versions.
You can use use a single OneHotEncoder and fit_transform both of your categorical columns in one go.
You can use parameter drop='first' to "handle the dummy variable trap" inside OneHotEncoder.
You can use ColumnTransformer to avoid manually using hstack.

That is, I'd suggest instead:

tfmr = ColumnTransformer(
    transformers=[('ohe', OneHotEncoder(drop='first', sparse=False), [5,6])], 
    remainder='passthrough'
    )
X_preproc = tfmr.fit_transform(X)

why I get in Z1 2 columns instead of 3 and how to fix it using hotEncoder

1 Answers1