Interestingly, I see a lot of different answers about this both on stackoverflow and other sites:
While working on my training data set, I imputed missing values of a certain column using a decision tree model. So here's my question. Is it fair to use ALL available data (Training & Test) to make a model for imputation (not prediction) or may I only touch the training set when doing this? Also, once I begin work on my Test set, must I use only my test set data, impute using the same imputation model made in my training set, or can I use all the data available to me to retrain my imputation model?
I would think so long as I didn't touch my test set for prediction model training, using the rest of the data for things like imputations would be fine. But maybe that would be breaking a fundamental rule. Thoughts?