LabelEncoder encodes different values to same value

Question

Summary: Sklearn's LabelEncoder encodes different values into same value. encoder.fit(data) and data_encoded = encoder.transform(data) can be done properly, but when I do encoder.inverse_transform(data_encoded) just right after the transform, it raises an error:

ValueError: y contains previously unseen labels: [19297]

What I want to do:

I have slightly big data (nearly 1.5 GB) and there are more to come. To achieve a machine learning task, I need to label encode the data but I can't do it on my local laptop because the encoder have to see the whole data to encode properly, but the data is slightly big to process.

What I have done:

So I loaded my whole data into Google Colab, encoded it, then saved the encoder with pickle. Then in my local PC, I took the newcomer data (which is much smaller), loaded the encoder back, updated the encoder.classes_ (see code part 1)

Then transformed the newcomer data. Then tried to inverse transform it right after to be sure it's done properly. But It raised an error. (see part 2)

Then I checked the value 19297 to see its indexes with data.loc. Got the indexes and checked the original data and realized that the encoder encodes different values into the same value (different values -> 19297). Can anyone help with this problem? Tanks a lot.

Part 1:

with open("data/encoders/item_id_encoder.pkl", 'rb') as file:
    item_encoder = pickle.load(file)
with open("data/encoders/store_id_encoder.pkl", 'rb') as file:
    store_encoder = pickle.load(file)
with open("data/encoders/week_encoder.pkl", 'rb') as file:
    week_encoder = pickle.load(file)

item_classes = set(item_encoder.classes_) 
store_classes = set(store_encoder.classes_)
week_classes = set(week_encoder.classes_)

bar.start()
for row in data.itertuples():
    if row.item_id not in item_classes:
        item_classes.add(row.item_id)
        item_encoder.classes_ = np.append(item_encoder.classes_, row.item_id)
    if row.store_id not in store_classes:
        store_classes.add(row.store_id)
        store_encoder.classes_ = np.append(store_encoder.classes_, row.store_id)
    if row.week not in week_classes:
        week_classes.add(row.week)
        week_encoder.classes_ = np.append(week_encoder.classes_, row.week)
    bar.update(row.Index)
bar.finish()

Part 2:

store_ids = store_encoder.transform(data.store_id)
item_ids = item_encoder.transform(data.item_id)
weeks = week_encoder.transform(data.week)

# this raises error
item_encoder.inverse_transform(item_ids)

This sort of thing is a regular source of frustration in my experience. I tend to use `pandas.get_dummies` or write my own not-particularly-optimised label encoder. There's a good discussion on the topic here: https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values — ame, Mar 28 '19 at 11:09
@ame It seems like there is no other way than writing own label encoder. I just worry about performance issues because I'm not a python wizard so I'm sure I will try to do everything with bunches of nested for loops :) — emremrah, Mar 28 '19 at 11:15

score 0 · Accepted Answer · answered Mar 29 '19 at 07:07

Well, I have implemented a label encoder with update capability. You can:

Fit a data-encoder to initiate an encoder. The encoder will be saved as a pickle file.
Update an encoder with newcomer data
Transform and inverse transform your data

See: https://github.com/emremrah/Lencoder

LabelEncoder encodes different values to same value

1 Answers1