12

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package.

I want to do something like this:

require(mice)
data(boys)

train_boys <- boys[1:400,]
test_boys <- boys[401:nrow(boys),]

mice_object <- mice(train_boys)
train_complete_boys <- complete(train_boys)

# Here comes a hypothetical method
test_complete_boys <- predict(mice_object, test_boys)

I would like to find some approach that would emulate the code above. Now, it's totally possible to do separate mice operations on train and test datasets separately, but it seems like from logical point of view that would be incorrect - all the information you have is in the train dataset. Observations from test dataset shouldn't provide information for each other. That's especially true when dealing with data when observations can be ordered by time of appearance.

One possible approach is to add rows from test dataset to train dataset iteratively, running imputation every time. However this seems very inelegant.

So here is the question:

Is there a method for the mice package that would be similar to the general predict method? If not, what are the possible workarounds?

Thank you!

slamballais
  • 2,599
  • 1
  • 14
  • 28
Loiisso
  • 151
  • 6

1 Answers1

2

I think it could be logically incorrect to "predict" missing values with another imputed dataset, since MICE algorithm is building models iteratively to estimate the missing values by the observed values in your given dataset.

In other words, when you do mice_object <- mice(train_boys), the algorithm estimates and imputes the NAs by the relationships between variables in train_boys. However, such estimation cannot be applied to test_boy because the relationships between variables in test_boy may differ from those in train_boy. Also, the amount of observed information is different between these two datasets.

If you believe the relationships between variables are homogeneous across train_boys and test_boys, how about doing NA imputation before splitting the dataset? i.e.:

mice_object <- mice(boys)
complete_boys <- compete(mice_object)
train_boys <- complete_boys[1:400,]
test_boys <- complete_boys[401:nrow(complete_boys),]

You can read Multiple imputation by chained equations: What is it and how does it work? if you need more information of MICE.

ytu
  • 1,512
  • 2
  • 12
  • 37