I have a large data frame containing epidemiological data (48232 rows and 74 columns). I read it into R as a .csv file and use the command na.strings="NA". I have several dichotomous variables with YES/NO answers coded 0=YES, 1=NO. These variables also contain NAs. I would like to create a new data frame containing all columns, but removing those samples that has Diab=0, NOT removing Diab=NA. I use square brackets for this. When doing so, the dimension of the new data frame is correct, however, all samples that was Diab=NA end up as NA for ALL other dichotomous variables in the new data frame! How do I solve this problem? I have tried to generate a small example:
Diab<-c(0,NA,1,1,1,0,0,NA, NA)
INF<-c(0,1,1,1,1,1,NA, 0,1)
HYP<-c(NA, 0,1,0,NA,1,1,1,1)
a<-data.frame(cbind(Diab, INF, HYP))
dim(a)
table(a$Diab,a$HYP, exclude=NULL, dnn=c("Diab", "HYP"))
#In total 2 persons HYP=0, 5 persons HYP=1, 2 persons HYP=NA.
b<-a[!a$Diab==0,]
dim(b)
##When removing those Diab=0 I'm expecting to still have 2 persons HYP=0,
#3 persons HYP=1 and 1 person HYP=NA, but not...
table(b$HYP, exclude=NULL, dnn="HYP")
#6 persons in total but those that were Diab=NA are now turned into HYP=NA??
#The same happens with the INF variable.
table(a$Diab,a$INF, exclude=NULL, dnn=c("Diab", "INF"))
table(b$INF, exclude=NULL, dnn="INF")
I have read this SO question on mysterious NA rows and this mailing list thread on subsetting vs. bracketing but unfortunately it doesn't help me even though it seems a bit familiar...
I will be extremely happy for any help! Thanks, Charlotta