Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2225 questions
22
votes
3 answers

Fill in missing pandas data with previous non-missing value, grouped by key

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id'…
ChrisB
  • 4,160
  • 6
  • 23
  • 39
21
votes
5 answers

Replace all NA with FALSE in selected columns in R

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop. Can plyr do the trick?…
lokheart
  • 20,665
  • 32
  • 86
  • 161
21
votes
2 answers

python scikit-learn clustering with missing data

I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column. With scikit-learn, missing data is not possible. There is also no chance to specify a user distance…
Michael Hecht
  • 1,445
  • 5
  • 14
  • 26
19
votes
6 answers

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate: 0-99 Data -1 Question not asked -5 Do not know -7 Refused to respond -9 Module not asked Stata has a beautiful facility for handling these…
Ari B. Friedman
  • 66,857
  • 33
  • 169
  • 226
19
votes
8 answers

How to fill NAs with LOCF by factors in data frame, split by country

I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values: country value AUT NA AUT 5 AUT NA AUT NA GER NA GER NA GER 7 GER NA GER NA The…
rp1
  • 351
  • 1
  • 2
  • 9
18
votes
3 answers

How to subset a data frame by taking only the Non NA values of 2 columns in this data frame

I am trying to subset a data frame by taking the integer values of 2 columns om my data frame Subs1<-subset(DATA,DATA[,2][!is.na(DATA[,2])] & DATA[,3][!is.na(DATA[,3])]) but it gives me an error : longer object length is not a multiple of shorter…
EnginO
  • 261
  • 3
  • 4
  • 8
18
votes
3 answers

How to replace NA (missing values) in a data frame with neighbouring values

862 2006-05-19 6.241603 5.774208 863 2006-05-20 NA NA 864 2006-05-21 NA NA 865 2006-05-22 6.383929 5.906426 866 2006-05-23 6.782068 6.268758 867 2006-05-24 6.534616 6.013767 868 2006-05-25 6.370312…
Arun
  • 437
  • 5
  • 12
17
votes
5 answers

java.lang.NoClassDefFoundError: android.support.v7.appcompat.R$styleable

i am using terminal [not eclipse]. i got following exception error, while i use emulator.debug successfully and installd successfully. But emulator show Unfortunatly app has stop. Then i run $ adb logcat it will display following.…
Balakrishnan
  • 266
  • 2
  • 5
  • 15
17
votes
2 answers

Filling in missing (blanks) in a data table, per category - backwards and forwards

I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it…
Farrel
  • 9,584
  • 19
  • 57
  • 95
16
votes
3 answers

pandas - merging with missing values

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs: >>> foo = DataFrame([ ['a',1,2], ['b',4,5], ['c',7,8], [np.NaN,10,11] ],…
aensm
  • 2,685
  • 9
  • 26
  • 44
16
votes
2 answers

missing value in highcharts line graph results in no line, just points

please take a look at this: http://jsfiddle.net/2rNzr/ var chart = new Highcharts.Chart({ chart: { renderTo: 'container' }, xAxis: { categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',…
Tony
  • 415
  • 1
  • 6
  • 11
15
votes
4 answers

Exporting ints with missing values to csv in Pandas

When saving a Pandas DataFrame to csv, some integers are getting converted in floats. It happens where a column of floats has missing values (np.nan). Is there a simple way to avoid it? (Especially in an automatic way - I often deal with many…
Piotr Migdal
  • 9,638
  • 7
  • 52
  • 77
14
votes
3 answers

Fill missing dates by group

In my data, there exist observations for some IDs in some months and not for others, e.g. dat <- data.frame(c(1, 1, 1, 2, 3, 3, 3, 4, 4, 4), c(rep(30, 2), rep(25, 5), rep(20, 3)), c('2017-01-01', '2017-02-01', '2017-04-01', '2017-02-01',…
kathystehl
  • 781
  • 8
  • 25
13
votes
3 answers

Dataset in base R with missing values

Are there any examples of dataset in base R that contain missing values? I've been looking through each one in turn and also searched using google-nothing so far. library(MASS) data() Edit: I know how to add missing values to a dataset in R, I…
John_dydx
  • 931
  • 1
  • 12
  • 26
13
votes
2 answers

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using pd.rolling_mean(data["variable"]), 12, center=True) but it just gives me…
Alexis Eggermont
  • 5,595
  • 14
  • 49
  • 84