Subsetting R data frame with NAs in index variable

Question

I have a large data frame containing epidemiological data (48232 rows and 74 columns). I read it into R as a .csv file and use the command na.strings="NA". I have several dichotomous variables with YES/NO answers coded 0=YES, 1=NO. These variables also contain NAs. I would like to create a new data frame containing all columns, but removing those samples that has Diab=0, NOT removing Diab=NA. I use square brackets for this. When doing so, the dimension of the new data frame is correct, however, all samples that was Diab=NA end up as NA for ALL other dichotomous variables in the new data frame! How do I solve this problem? I have tried to generate a small example:

   Diab<-c(0,NA,1,1,1,0,0,NA, NA)
INF<-c(0,1,1,1,1,1,NA, 0,1)
HYP<-c(NA, 0,1,0,NA,1,1,1,1)

a<-data.frame(cbind(Diab, INF, HYP))
dim(a)
table(a$Diab,a$HYP, exclude=NULL, dnn=c("Diab", "HYP"))
#In total 2 persons HYP=0, 5 persons HYP=1, 2 persons HYP=NA. 

b<-a[!a$Diab==0,]
dim(b)
##When removing those Diab=0 I'm expecting to still have 2 persons HYP=0, 
#3 persons HYP=1 and 1 person HYP=NA, but not...

table(b$HYP, exclude=NULL, dnn="HYP")
#6 persons in total but those that were Diab=NA are now turned into HYP=NA??

#The same happens with the INF variable.
table(a$Diab,a$INF, exclude=NULL, dnn=c("Diab", "INF"))
table(b$INF, exclude=NULL, dnn="INF")

I have read this SO question on mysterious NA rows and this mailing list thread on subsetting vs. bracketing but unfortunately it doesn't help me even though it seems a bit familiar...

I will be extremely happy for any help! Thanks, Charlotta

score 3 · Answer 1 · answered Apr 10 '13 at 11:45

The problem lies in that you are trying to subset a column of data that contains NA. You will have to formulate a more efficient way to subset your dataset.

As you've written:

> a$Diab
[1]  0 NA  1  1  1  0  0 NA NA

Which of these values are NOT equal to zero?

> !a$Diab==0
[1]  TRUE    NA FALSE FALSE FALSE  TRUE  TRUE    NA    NA

As you can see. You get NA:s as answer when evaluating NA:s. In the same fashion you can't do operations such as:

> c(NA,NA,3)+1
[1] NA NA  4

You get the idea.. The script can't select the correct rows in your dataframe because of it returns NAs and therefore you get NAs in your subsetted dataframe.

Solution: either change the NAs to something that you can handle more easily (if needed) or adjust your script for subseting the data adjusted for the NA values. is.na() is a function that could be used for this case. So lets select all the values NOT equal to 0 OR values that are NAs in the Diab column:

> a[(a$Diab != 0) | is.na(a$Diab),]
  Diab INF HYP
2   NA   1   0
3    1   1   1
4    1   1   0
5    1   1  NA
8   NA   0   1
9   NA   1   1

For more info regarding missing values, look here.

Thank you so much both of you! – Charlotta Rylander Apr 10 '13 at 11:55 — Charlotta Rylander, Apr 10 '13 at 11:55

score 0 · Accepted Answer · answered Apr 10 '13 at 11:27

0

I think this does what you wanted:

> a[(a$Diab != 0) | is.na(a$Diab),]
  Diab INF HYP
2   NA   1   0
3    1   1   1
4    1   1   0
5    1   1  NA
8   NA   0   1
9   NA   1   1

You need to find entries in Diab which are either not equal to zero (!= 0) or equal to NA (is.na). The boolean operator | means OR.

answered Apr 10 '13 at 11:27

Paul Hiemstra

56,833
11
132
142

Thank you for your quick answer! That helped a lot! – Charlotta Rylander Apr 10 '13 at 11:56
If this solves your answer, you can press the green tick mark to the left hand side of my answer. This shows everyone you have an answer. – Paul Hiemstra Apr 10 '13 at 12:04

Subsetting R data frame with NAs in index variable

2 Answers2

Linked