18

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.

I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.

When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.

What am I doing wrong here? How do I go about predicting the missing values?

import numpy as np
from sklearn.preprocessing import Imputer

X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])

print X

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)

print X
smci
  • 26,085
  • 16
  • 96
  • 138
xennygrimmato
  • 2,157
  • 5
  • 21
  • 41
  • That's not generally called prediction, it's called imputation. Unless the missing values are all in the future. – smci Apr 12 '18 at 22:50

3 Answers3

27

Per the documentation, sklearn.preprocessing.Imputer.fit_transform returns a new array, it doesn't alter the argument array. The minimal fix is therefore:

X = imp.fit_transform(X)
jonrsharpe
  • 99,167
  • 19
  • 183
  • 334
  • That is working fine, thanks. However, the predicted values for all missing values are coming out to be the same. I took much larger datasets too and still all 'NaN's were being replaced by the same value. What do I need to change in my program? – xennygrimmato Jul 29 '14 at 14:40
  • These aren't "predicted" values, they're just replacements for missing data. Your strategy is `'mean'`, so it will *"replace missing values using the mean along the axis"*. – jonrsharpe Jul 29 '14 at 14:48
  • Okay. Which algorithm should I use for predicting the missing values then? – xennygrimmato Jul 29 '14 at 14:49
  • I don't know - you haven't said how you want them to be replaced. Also, I'm basically just reading the documentation to answer your questions - why don't you have a look there? – jonrsharpe Jul 29 '14 at 14:58
  • 7
    Additionally, you can set `copy=False` in the constructor to do imputation in-place and avoid creating a copy whenever possible. – Gilles Louppe Jul 31 '14 at 07:03
  • 2
    @Rayu You may want to use multiple imputation to do this correctly. See here for more information about doing so using pandas and the very nice port of MICE by Frank Cheng: http://gsocfrankcheng.blogspot.ca/ – Don Apr 15 '16 at 13:55
7

After scikit-learn version 0.20 the usage of impute module was changed. Now, we can use imputer like;

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(X)
X=impute.transform(X)

Pay attention:

Instead of 'NaN', np.nan is used

Don't need to use axis parameter

We can use imp or imputer instead of my impute variable

msklc
  • 378
  • 3
  • 8
2

Note: Due to the change in the sklearn library 'NaN' has to be replaced with np.nan as shown below.

 from sklearn.preprocessing import Imputer
 imputer = Imputer(missing_values= np.nan,strategy='mean',axis=0)  
 imputer = imputer.fit(X[:,1:3])
 X[:,1:3]= imputer.transform(X[:,1:3])