6

Interestingly, I see a lot of different answers about this both on stackoverflow and other sites:

While working on my training data set, I imputed missing values of a certain column using a decision tree model. So here's my question. Is it fair to use ALL available data (Training & Test) to make a model for imputation (not prediction) or may I only touch the training set when doing this? Also, once I begin work on my Test set, must I use only my test set data, impute using the same imputation model made in my training set, or can I use all the data available to me to retrain my imputation model?

I would think so long as I didn't touch my test set for prediction model training, using the rest of the data for things like imputations would be fine. But maybe that would be breaking a fundamental rule. Thoughts?

Analysa
  • 71
  • 6

3 Answers3

3

Do not use any information from the Test set when doing any processing on your Training set. @Maxim and the answer linked to are correct, but I want to augment the answer.

Imputation attempts to reason from incomplete data to suggest likely values for the missing entries. I think it's helpful to consider the missing values as a form of measurement error (see this article for a useful demonstration of this). As such, there are reasons to believe that the missingness is related to the underlying data generating process. And that process is precisely what you're attempting to replicate (though, of course, imperfectly) with your model.

If you want your model to generalize well -- don't we all! -- then it is best to make sure that whatever processing you do to the training set is dependent only on the information in the data contained within that set.

I would even suggest that you consider a three-way split: Test, Training, and Validation sets. The Validation set is further culled from the Training set and used to test model fit against "itself" (in the tuning of hyperparameters). This is, in part, what cross validation procedures do in things like sklearn and other pipelines. In this case, I generally conduct the imputation after the CV split, rather than on the full Training set, since I am attempting to evaluate a model on data the model "knows" (and the holdout data are a proxy for the unknown/future data). But note that I have not seen this suggested as uniformly as maintaining a complete wall between Test and Training sets.

Savage Henry
  • 1,617
  • 3
  • 18
  • 27
  • So if you use an imputation method like `KNNImputer` or simply the **mean** or **median**, do you fit your imputer on the training set, and then use those exact results on the test and validation sets? Or do you re-fit it on the test and validation? E.g. assume you're using **mean**, and for column `Age` in the training set you got `27`. Do you use this `27` for the missing values in the `Age` column in the test and validation sets as well, or do you re-calculate the mean for them independently? – Alaa M. Feb 26 '21 at 14:07
  • 1
    If you plan to do imputation of missing data when the model performs in "the wild", then you can use the results of the imputer you fit on the training set when doing testing and validation. The intuition is: the model is fitting data *and* filling in where data is missing, so the imputer built on your training data is the model's best approximation for guessing the missing value. But remember, train the imputer on the training set only, otherwise the imputer is learning from data it should not have "seen". – Savage Henry Feb 27 '21 at 15:26
1

I would agree with this answer on cross-validated:

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information

The way you preprocess the data may affect model performance, in some cases significantly. Test data is a proxy for samples that you don't know. Would you perform imputation differently if you knew all future data? If yes, then using the test data is cheating. If no, there's no need for test data anyway. So it's better not to touch test data at all, until the model is built.

Maxim
  • 47,916
  • 23
  • 132
  • 189
0

The philosophy behind splitting data into training and test sets is to have the opportunity of validating the model through fresh(ish) data, right? So, by using the same imputer on both train and test sets, you are somehow spoiling the test data, and this may cause overfitting. You CAN use the same approach to impute the missing data on both sets (in your case, the decision tree), however, you should instantiate two different models, and fit each one with its own related data.

Reza Keshavarz
  • 594
  • 7
  • 14