Predicting missing values with scikit-learn's Imputer module

Question

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.

I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.

When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.

What am I doing wrong here? How do I go about predicting the missing values?

import numpy as np
from sklearn.preprocessing import Imputer

X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])

print X

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)

print X

That's not generally called prediction, it's called imputation. Unless the missing values are all in the future. — smci, Apr 12 '18 at 22:50

score 27 · Accepted Answer · answered Jul 29 '14 at 14:20

27

Per the documentation, sklearn.preprocessing.Imputer.fit_transform returns a new array, it doesn't alter the argument array. The minimal fix is therefore:

X = imp.fit_transform(X)

answered Jul 29 '14 at 14:20

jonrsharpe

99,167
19
183
334

That is working fine, thanks. However, the predicted values for all missing values are coming out to be the same. I took much larger datasets too and still all 'NaN's were being replaced by the same value. What do I need to change in my program? – xennygrimmato Jul 29 '14 at 14:40
These aren't "predicted" values, they're just replacements for missing data. Your strategy is `'mean'`, so it will *"replace missing values using the mean along the axis"*. – jonrsharpe Jul 29 '14 at 14:48
Okay. Which algorithm should I use for predicting the missing values then? – xennygrimmato Jul 29 '14 at 14:49
I don't know - you haven't said how you want them to be replaced. Also, I'm basically just reading the documentation to answer your questions - why don't you have a look there? – jonrsharpe Jul 29 '14 at 14:58
7

Additionally, you can set `copy=False` in the constructor to do imputation in-place and avoid creating a copy whenever possible. – Gilles Louppe Jul 31 '14 at 07:03
2

@Rayu You may want to use multiple imputation to do this correctly. See here for more information about doing so using pandas and the very nice port of MICE by Frank Cheng: http://gsocfrankcheng.blogspot.ca/ – Don Apr 15 '16 at 13:55

msklc · Answer 2 · 2020-06-08T20:37:07.517

7

After scikit-learn version 0.20 the usage of impute module was changed. Now, we can use imputer like;

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(X)
X=impute.transform(X)

Pay attention:

Instead of 'NaN', np.nan is used

Don't need to use axis parameter

We can use imp or imputer instead of my impute variable

edited Jun 08 '20 at 20:37

answered Dec 21 '19 at 12:58

msklc

378
3
8

score 2 · Answer 3 · edited Mar 12 '20 at 06:56

2

Note: Due to the change in the sklearn library 'NaN' has to be replaced with np.nan as shown below.

 from sklearn.preprocessing import Imputer
 imputer = Imputer(missing_values= np.nan,strategy='mean',axis=0)  
 imputer = imputer.fit(X[:,1:3])
 X[:,1:3]= imputer.transform(X[:,1:3])

edited Mar 12 '20 at 06:56

Shrikant Chaudhari

33
5

answered Aug 17 '18 at 18:09

MD SAZID KHAN

21
2

Predicting missing values with scikit-learn's Imputer module

3 Answers3

Linked