10

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?

TTZ
  • 673
  • 2
  • 8
  • 18

3 Answers3

15

As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.

Here is the code from the documentation

from imblearn.over_sampling import SMOTENC smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0) X_resampled, y_resampled = smote_nc.fit_resample(X, y)

Azaf Tanveer
  • 151
  • 1
  • 3
  • Hi, do you know what should I use if the dataset has discrete and categorical features? – mjbsgll Oct 31 '19 at 18:20
  • You need to mentional only categorical features indexes as mentioned above in 2nd line. Discrete features, it automatically take cares. – Deepak May 25 '20 at 20:46
2

So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.

You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.

Then use np.round(X_train[categorical_variables]) to convert them back to the respective categorical values.

Mayank
  • 569
  • 1
  • 4
  • 15
  • 1
    This solution isn't very appropriate for nominal data where order means nothing. Say for example we codify the classes Women's Clothes, Cars, Women's Shoes as 0,1,2. Then, using SMOTE we take 2 samples where one has category 0, and the other has category 2, and we end up interpolating such that the rounded value is 1. The final result would be that we have a generated data sample classified in the 'Car' category whereas the parents belonged to Women's Clothes and Women's Shoes, which is totally meaningless. – Daniel Crane Jun 22 '18 at 09:41
  • 3
    Can we convert the categorial data into one-hot-encoding and then apply SMOTE. – Pragya Aug 05 '18 at 04:04
  • Now, this feature has been released by them in the form of SMOTENC where it handles numerical and categorical variables both – Deepak May 25 '20 at 20:47
2

As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.

For those with an academic interest in this ongoing issue, the paper from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.

Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.

cph_sto
  • 5,368
  • 6
  • 31
  • 55
  • Link to paper broken, but most latest version of link in Wayback Machine can be found here: https://web.archive.org/web/20180413091607/https://www.jair.org/media/953/live-953-2037-jair.pdf – lampShadesDrifter May 14 '21 at 07:13