Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2225 questions
931
votes
16 answers

Remove rows with all or some NAs (missing values) in data.frame

I'd like to remove the lines in this data frame that: a) contain NAs across all columns. Below is my example data frame. gene hsap mmul mmus rnor cfam 1 ENSG00000208234 0 NA NA NA NA 2 ENSG00000199674 0 2 2 2 …
Benoit B.
  • 10,466
  • 8
  • 23
  • 29
819
votes
22 answers

How do I replace NA values with zeros in an R dataframe?

I have a data frame and some columns have NA values. How do I replace these NA values with zeroes?
Renato Dinhani
  • 30,005
  • 49
  • 125
  • 194
204
votes
7 answers

Remove NA values from a vector

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values. How can I remove the NA values so that I can compute the max?
CodeGuy
  • 26,751
  • 71
  • 191
  • 310
99
votes
9 answers

How to lowercase a pandas dataframe string column if it has missing values?

The following code does not work. import pandas as pd import numpy as np df=pd.DataFrame(['ONE','Two', np.nan],columns=['x']) xLower = df["x"].map(lambda x: x.lower()) How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is…
P.Escondido
  • 2,553
  • 4
  • 19
  • 26
83
votes
1 answer

str.format() raises KeyError

The following code raises a KeyError exception: addr_list_formatted = [] addr_list_idx = 0 for addr in addr_list: # addr_list is a list addr_list_idx = addr_list_idx + 1 addr_list_formatted.append(""" "{0}" { …
Dor
  • 6,916
  • 4
  • 28
  • 45
82
votes
13 answers

Elegant way to report missing values in a data.frame

Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck: for (Var in names(airquality)) { …
Zach
  • 27,553
  • 31
  • 130
  • 193
66
votes
5 answers

Delete rows with blank values in one particular column

I am working on a large dataset, with some rows with NAs and others with blanks: df <- data.frame(ID = c(1:7), home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"), …
KT_1
  • 6,722
  • 12
  • 46
  • 61
65
votes
9 answers

Format string unused named arguments

Let's say I have: action = '{bond}, {james} {bond}'.format(bond='bond', james='james') this wil output: 'bond, james bond' Next we have: action = '{bond}, {james} {bond}'.format(bond='bond') this will output: KeyError: 'james' Is there some…
nelsonvarela
  • 2,140
  • 6
  • 24
  • 41
53
votes
12 answers

Replace missing values with column mean

I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well. Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE)) The code for…
Nikita
  • 747
  • 2
  • 8
  • 14
51
votes
9 answers

Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below: timestamp …
James A
  • 565
  • 2
  • 5
  • 8
51
votes
3 answers

What is the difference between and NA?

I have a factor named SMOKE with levels "Y" and "N". Missing values were replaced with NA (from the initial level "NULL"). However when I view the factor I get something like this: head(SMOKE) # N N Y Y N # Levels: Y N Why is R displaying NA…
oort
  • 1,710
  • 2
  • 18
  • 27
50
votes
6 answers

Python, Pandas : Return only those rows which have missing values

While working in Pandas in Python... I'm working with a dataset that contains some missing values, and I'd like to return a dataframe which contains only those rows which have missing data. Is there a nice way to do this? (My current method to do…
user2487726
50
votes
1 answer

Include levels of zero count in result of table()

I have a vector 'y' and I count the different values using table: y <- c(0, 0, 1, 3, 4, 4) table(y) # y # 0 1 3 4 # 2 1 1 2 However, I also want the result to include the fact that there are zero 2's and zero 5's. Can I use table() for…
Christopher DuBois
  • 38,442
  • 23
  • 68
  • 91
42
votes
3 answers

Convert NA into a factor level

I have a vector with NA values that I would like to replace by a new factor level NA. a = as.factor(as.character(c(1, 1, 2, 2, 3, NA))) a [1] 1 1 2 2 3 Levels: 1 2 3 This works, but it seems like a strange way to do it. a =…
marbel
  • 6,933
  • 5
  • 46
  • 65
40
votes
3 answers

Dealing with missing values for correlations calculation

I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables…
Delphine
  • 1,023
  • 5
  • 15
  • 22
1
2 3
99 100