Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

671 questions
10
votes
4 answers

Imputer on some Dataframe columns in Python

I am learning how to use Imputer on Python. This is my code: df=pd.DataFrame([["XXL", 8, "black", "class 1", 22], ["L", np.nan, "gray", "class 2", 20], ["XL", 10, "blue", "class 2", 19], ["M", np.nan, "orange", "class 1", 17], ["M", 11, "green",…
Mauro Gentile
  • 1,055
  • 4
  • 21
  • 33
9
votes
3 answers

R: replace NA with item from vector

I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x …
gregmacfarlane
  • 1,907
  • 2
  • 22
  • 44
9
votes
0 answers

Use of statsmodels.imputation.mice

I am exploring statsmodels.imputation.mice package to use for imputing missing values. I haven't seen any example of its usage, though, outside of http://www.statsmodels.org. From what I gather, one would create an instance of mice.MICEData and use…
David Makovoz
  • 1,411
  • 2
  • 13
  • 23
8
votes
3 answers

How to transform some columns only with SimpleImputer or equivalent

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame. I have read carefully the documentation but I still cannot figure out how to achieve this. To make this more specific, let's…
quiet-ranger
  • 403
  • 4
  • 10
8
votes
1 answer

Multiple Imputation of missing and censored data in R

I have a dataset with both missing-at-random (MAR) and censored data. The variables are correlated and I am trying to impute the missing data conditionally so that I can estimate the distribution parameters for a correlated multivariate normal…
chelsea
  • 105
  • 3
7
votes
3 answers

Implementation of sklearn.impute.IterativeImputer

Consider data which contains some nan below: Column-1 Column-2 Column-3 Column-4 Column-5 0 NaN 15.0 63.0 8.0 40.0 1 60.0 51.0 NaN 54.0 31.0 2 15.0 17.0 55.0 80.0 NaN 3 54.0 43.0 70.0 16.0 …
k.ko3n
  • 844
  • 5
  • 17
7
votes
2 answers

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be…
6
votes
3 answers

Generate larger synthetic dataset based on a smaller dataset in Python

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby. I've…
JChat
  • 628
  • 10
  • 27
6
votes
1 answer

Differences between sklearn's SimpleImputer and Imputer

In python's sklearn library there exist two classes, which are doing approximately the same things: sklearn.preprocessing.Imputer and sklearn.impute.SimpleImputer The only difference that I found is a "constant" strategy type in SimpeImputer. Is…
MefAldemisov
  • 745
  • 7
  • 19
6
votes
1 answer

How to do forward filling for each group in pandas

I have a dataframe similar to below id A B C D E 1 2 3 4 5 5 1 NaN 4 NaN 6 7 2 3 4 5 6 6 2 NaN NaN 5 4 1 I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want…
HHH
  • 4,945
  • 14
  • 76
  • 138
6
votes
2 answers

Is there a way to impute missing values in machine learning?

For personal knowledge, I've been trying out different imputation methods other than the mean/median/mode. I was able to try out KNN, MICE, median imputational methods so far. I was told that imputation by clustering method can also be done and my…
uharsha33
  • 225
  • 2
  • 8
6
votes
3 answers

Can I use Train AND Test data for Imputation?

Interestingly, I see a lot of different answers about this both on stackoverflow and other sites: While working on my training data set, I imputed missing values of a certain column using a decision tree model. So here's my question. Is it fair to…
Analysa
  • 71
  • 6
6
votes
3 answers

Error in "missforest" in R

Need help to get around the below error while performing data imputation in R using "missforest" package. > imputed<- missForest(dummy, maxiter = 10, ntree = 100, variablewise = TRUE, + decreasing = TRUE, verbose = TRUE, + …
Sandeep
  • 71
  • 10
6
votes
4 answers

Python - SkLearn Imputer usage

I have the following question: I have a pandas dataframe, in which missing values are marked by the string na. I want to run an Imputer on it to replace the missing values with the mean in the column. According to the sklearn documentation, the…
lte__
  • 5,472
  • 13
  • 55
  • 106
5
votes
2 answers

Testing for missing values in R

I have a time series data set which has some missing values in it. I wish to impute the missing values but I am unsure as to which method is most appropriate e.g linear, spline or stine from the imputeTS package. For the sake of completeness I wish…
TheGoat
  • 1,765
  • 2
  • 16
  • 40
1
2
3
44 45