1

I have a large data frame containing epidemiological data (48232 rows and 74 columns). I read it into R as a .csv file and use the command na.strings="NA". I have several dichotomous variables with YES/NO answers coded 0=YES, 1=NO. These variables also contain NAs. I would like to create a new data frame containing all columns, but removing those samples that has Diab=0, NOT removing Diab=NA. I use square brackets for this. When doing so, the dimension of the new data frame is correct, however, all samples that was Diab=NA end up as NA for ALL other dichotomous variables in the new data frame! How do I solve this problem? I have tried to generate a small example:

   Diab<-c(0,NA,1,1,1,0,0,NA, NA)
INF<-c(0,1,1,1,1,1,NA, 0,1)
HYP<-c(NA, 0,1,0,NA,1,1,1,1)

a<-data.frame(cbind(Diab, INF, HYP))
dim(a)
table(a$Diab,a$HYP, exclude=NULL, dnn=c("Diab", "HYP"))
#In total 2 persons HYP=0, 5 persons HYP=1, 2 persons HYP=NA. 

b<-a[!a$Diab==0,]
dim(b)
##When removing those Diab=0 I'm expecting to still have 2 persons HYP=0, 
#3 persons HYP=1 and 1 person HYP=NA, but not...

table(b$HYP, exclude=NULL, dnn="HYP")
#6 persons in total but those that were Diab=NA are now turned into HYP=NA??

#The same happens with the INF variable.
table(a$Diab,a$INF, exclude=NULL, dnn=c("Diab", "INF"))
table(b$INF, exclude=NULL, dnn="INF")

I have read this SO question on mysterious NA rows and this mailing list thread on subsetting vs. bracketing but unfortunately it doesn't help me even though it seems a bit familiar...

I will be extremely happy for any help! Thanks, Charlotta

Community
  • 1
  • 1

2 Answers2

3

The problem lies in that you are trying to subset a column of data that contains NA. You will have to formulate a more efficient way to subset your dataset.

As you've written:

> a$Diab
[1]  0 NA  1  1  1  0  0 NA NA

Which of these values are NOT equal to zero?

> !a$Diab==0
[1]  TRUE    NA FALSE FALSE FALSE  TRUE  TRUE    NA    NA

As you can see. You get NA:s as answer when evaluating NA:s. In the same fashion you can't do operations such as:

> c(NA,NA,3)+1
[1] NA NA  4

You get the idea.. The script can't select the correct rows in your dataframe because of it returns NAs and therefore you get NAs in your subsetted dataframe.

Solution: either change the NAs to something that you can handle more easily (if needed) or adjust your script for subseting the data adjusted for the NA values. is.na() is a function that could be used for this case. So lets select all the values NOT equal to 0 OR values that are NAs in the Diab column:

> a[(a$Diab != 0) | is.na(a$Diab),]
  Diab INF HYP
2   NA   1   0
3    1   1   1
4    1   1   0
5    1   1  NA
8   NA   0   1
9   NA   1   1

For more info regarding missing values, look here.

nadizan
  • 1,244
  • 10
  • 22
0

I think this does what you wanted:

> a[(a$Diab != 0) | is.na(a$Diab),]
  Diab INF HYP
2   NA   1   0
3    1   1   1
4    1   1   0
5    1   1  NA
8   NA   0   1
9   NA   1   1

You need to find entries in Diab which are either not equal to zero (!= 0) or equal to NA (is.na). The boolean operator | means OR.

Paul Hiemstra
  • 56,833
  • 11
  • 132
  • 142