Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

671 questions

819

votes

22 answers

How do I replace NA values with zeros in an R dataframe?

I have a data frame and some columns have NA values. How do I replace these NA values with zeroes?

r dataframe na missing-data imputation

asked Nov 17 '11 at 03:45

Renato Dinhani

30,005
49
125
194

102

votes

10 answers

Pandas: filling missing values by mean in each group

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem.... Suppose I have the following dataframe df = pd.DataFrame({'value': [1, np.nan,…

python pandas pandas-groupby imputation fillna

asked Nov 13 '13 at 22:43

BlueFeet

1,937
4
16
24

votes

10 answers

Impute categorical missing values in scikit-learn

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is…

python pandas scikit-learn imputation

asked Aug 11 '14 at 09:26

night_bat

3,062
5
14
19

votes

12 answers

Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for…

r missing-data imputation

asked Sep 14 '14 at 16:50

Nikita

votes

5 answers

Replace all NA with FALSE in selected columns in R

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop. Can plyr do the trick?…

r dataframe na missing-data imputation

asked Sep 02 '11 at 03:59

lokheart

20,665
32
86
161

votes

3 answers

Predicting missing values with scikit-learn's Imputer module

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print…

python numpy scikit-learn prediction imputation

asked Jul 29 '14 at 14:16

xennygrimmato

2,157
5
21
41

votes

3 answers

How to replace NA (missing values) in a data frame with neighbouring values

862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312…

r missing-data imputation locf

asked Aug 09 '09 at 23:00

Arun

votes

3 answers

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing. My…

python machine-learning scikit-learn imputation countvectorizer

asked Jul 20 '20 at 17:00

Kevin Markham

4,396
1
23
33

votes

4 answers

Data imputation with fancyimpute and pandas

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't…

python python-3.x pandas imputation fancyimpute

asked Jul 21 '17 at 13:42

Rachel

1,627
4
21
51

votes

3 answers

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I…

scala apache-spark dataframe apache-spark-sql imputation

asked Oct 15 '16 at 09:54

Dataminer

1,229
3
13
21

votes

4 answers

Missing values in Time Series in python

I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor…

python pandas nan imputation

asked Mar 15 '18 at 20:22

Marco Miglionico

votes

3 answers

how to impute the distance to a value

I'd like to fill missing values with a "row distance" to the nearest non-NA value. In other words, how would I convert column x in this sample dataframe into column y? # x y #1 0 0 #2 NA 1 #3 0 0 #4 NA 1 #5 NA 2 #6 NA 1 #7 0 0 #8 NA…

r imputation

asked Dec 21 '18 at 18:07

Dan Strobridge

votes

1 answer

predict() method for "mice" package

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package. I want to do something like…

r imputation r-mice

asked Feb 02 '15 at 14:54

Loiisso

votes

2 answers

Pandas: How to fill null values with mean of a groupby?

I have a dataset will some missing data that looks like this: id category value 1 A NaN 2 B NaN 3 A 10.5 4 C NaN 5 A 2.0 6 B 1.0 I need to fill in the…

python pandas missing-data imputation

asked Oct 28 '16 at 06:12

sfactor

10,810
30
92
146

votes

3 answers

Oversampling: SMOTE for binary and categorical data in Python

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?

python-3.x imputation

asked Dec 05 '17 at 14:20

TTZ

2 3

…

44 45 Next