Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

671 questions
819
votes
22 answers

How do I replace NA values with zeros in an R dataframe?

I have a data frame and some columns have NA values. How do I replace these NA values with zeroes?
Renato Dinhani
  • 30,005
  • 49
  • 125
  • 194
102
votes
10 answers

Pandas: filling missing values by mean in each group

This should be straightforward, but the closest thing I've found is this post: pandas: Filling missing values within a group, and I still can't solve my problem.... Suppose I have the following dataframe df = pd.DataFrame({'value': [1, np.nan,…
BlueFeet
  • 1,937
  • 4
  • 16
  • 24
74
votes
10 answers

Impute categorical missing values in scikit-learn

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is…
night_bat
  • 3,062
  • 5
  • 14
  • 19
53
votes
12 answers

Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for…
Nikita
  • 747
  • 2
  • 8
  • 14
21
votes
5 answers

Replace all NA with FALSE in selected columns in R

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop. Can plyr do the trick?…
lokheart
  • 20,665
  • 32
  • 86
  • 161
18
votes
3 answers

Predicting missing values with scikit-learn's Imputer module

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class. I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array. When I print…
xennygrimmato
  • 2,157
  • 5
  • 21
  • 41
18
votes
3 answers

How to replace NA (missing values) in a data frame with neighbouring values

862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312…
Arun
  • 437
  • 5
  • 12
16
votes
3 answers

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing. My…
16
votes
4 answers

Data imputation with fancyimpute and pandas

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't…
Rachel
  • 1,627
  • 4
  • 21
  • 51
15
votes
3 answers

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I…
Dataminer
  • 1,229
  • 3
  • 13
  • 21
13
votes
4 answers

Missing values in Time Series in python

I have a time series dataframe, the dataframe is quite big and contain some missing values in the 2 columns('Humidity' and 'Pressure'). I would like to impute this missing values in a clever way, for example using the value of the nearest neighbor…
Marco Miglionico
  • 913
  • 1
  • 7
  • 25
12
votes
3 answers

how to impute the distance to a value

I'd like to fill missing values with a "row distance" to the nearest non-NA value. In other words, how would I convert column x in this sample dataframe into column y? # x y #1 0 0 #2 NA 1 #3 0 0 #4 NA 1 #5 NA 2 #6 NA 1 #7 0 0 #8 NA…
12
votes
1 answer

predict() method for "mice" package

I want to create imputation strategy using mice function from mice package. The problem is I can't seems to find any predict methods (or it's cousins) for new data in this package. I want to do something like…
Loiisso
  • 151
  • 6
11
votes
2 answers

Pandas: How to fill null values with mean of a groupby?

I have a dataset will some missing data that looks like this: id category value 1 A NaN 2 B NaN 3 A 10.5 4 C NaN 5 A 2.0 6 B 1.0 I need to fill in the…
sfactor
  • 10,810
  • 30
  • 92
  • 146
10
votes
3 answers

Oversampling: SMOTE for binary and categorical data in Python

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?
TTZ
  • 673
  • 2
  • 8
  • 18
1
2 3
44 45