62

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.

When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:

example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example

  var1 var2
1    A    X
2    B    Y
3    A    Z

then I run:

example[example$var1=="A",]

  var1 var2
1    A    X
3    A    Z
NA<NA> <NA>

Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.

Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.

Thanks

chrisg
  • 631
  • 2
  • 6
  • 5
  • 7
    While it's impossible to be sure without seeing your data, the problem is almost certainly that some of your indices are greater than the number of rows are in the data. For example, try `example[c(1, 2, 4),]` or `example[c(TRUE, TRUE, FALSE, TRUE),]` using your data frame above. Check the length (if it's boolean) and the maximum (if it's numeric) of the vector you are using to subset the rows. – David Robinson Jan 10 '13 at 15:47
  • 7
    ...and/or some of your indices are `NA` themselves. – joran Jan 10 '13 at 15:48
  • 1
    As David said, we need to know more... but looking at `str(yourdata)` and `summary(yourdata)` will help you out a lot. I have a feeling you have at least one `NA` in your `var` column. Test it: `example – Justin Jan 10 '13 at 15:48
  • 5
    If your code is analogous to this example (of the form `d[d$v == x, ], your problem is indeed almost certainly `NA`s in your column. – David Robinson Jan 10 '13 at 15:53
  • 9
    Answered! I have NAs in the index column. I can't believe I've never come across this before. It's funny to me that R "censors" the data in other columns with NAs (even the row name!) when you hit an NA in your index column. I'm new to posting on StackOverflow so it will take me a minute to figure out how to designate this question answered. – chrisg Jan 10 '13 at 17:17
  • Well I can't figure out how to designate this "answered", so maybe I made that part up. Thanks for the help – chrisg Jan 10 '13 at 17:25
  • 1
    @chrisg enter an answer yourself. – Matthew Lundberg Jan 10 '13 at 17:30

7 Answers7

65

Wrap the condition in which:

df[which(df$number1 < df$number2), ]

How it works:

It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.

Say that:

which(df$number1 < df$number2)

returns row numbers 1, 2, 3, 4 and 5.

As such, writing:

df[which(df$number1 < df$number2), ]

is the same as writing:

df[c(1, 2, 3, 4, 5), ]

Or an even simpler version is:

df[1:5, ]
Mus
  • 5,994
  • 19
  • 75
  • 112
c-urchin
  • 3,895
  • 6
  • 25
  • 28
30

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).

First of all, some sample data:

> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
   name number1 number2
1     A       1      10
2     B       2       9
3     C       3       8
4     D       4       7
5     E       5       6
6     F       6       5
7     G       7       4
8     H       8       3
9     I       9      NA
10    J      10      NA

Now for a simple filter:

> df[df$number1 < df$number2, ]
     name number1 number2
1       A       1      10
2       B       2       9
3       C       3       8
4       D       4       7
5       E       5       6
NA   <NA>      NA      NA
NA.1 <NA>      NA      NA

The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:

> df[df$number1 < df$number2 & !is.na(df$number2), ]
  name number1 number2
1    A       1      10
2    B       2       9
3    C       3       8
4    D       4       7
5    E       5       6
Waldir Leoncio
  • 9,134
  • 14
  • 68
  • 94
  • 1
    This is how I've always dealt with this issue, but is there a way to combine the !is.na and < into one command? – Nova Mar 04 '16 at 18:46
  • @Nova, I don't think so, since they are two distinct logical tests. I'd love to be proven wrong, though. – Waldir Leoncio Mar 07 '16 at 09:12
  • 3
    Answered above, the which() function may fit that role but it's less than satisfactory. I strongly believe this to be a bug imho and it's unfortunate that this "feature" (NA selection craziness) won't be fixed. – Chris Aug 31 '17 at 11:31
  • This is so helpful for understanding why this happens to me all the time. I agree with others that this is a bug. Hopefully someone on the R Core Team also agrees. – colin Nov 05 '20 at 09:35
  • @colin, I am not so sure it's a bug, nowadays I'd simply call it a result of the design philosophy behind R of not discarding NA values by default. Instead, what R usually does is "oh, there's an NA on this vector, so I'm just going to have the whole thing show as NA because I don't know what the value of NA stands for and how it affects the rest of the vector). Take for instance `mean(c(1, 3, NA))`. R will print NA because it doesn't know what the third value is, so it can't really tell you what the mean is. If the user wants to drop the NA, they have to explicitly set `na.rm=TRUE`. – Waldir Leoncio Nov 05 '20 at 14:04
13

I get the same problem when using code similar to what you posted. Using the function subset()

subset(example,example$var1=="A")

the NA row instead gets excluded.

user3612472
  • 139
  • 1
  • 3
  • 5
    This is helpful, but please beware of the potential problems of using `subset` anywhere other than in an interactive R session. From the function's help page: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." – Waldir Leoncio Dec 15 '14 at 18:58
4

Using dplyr:

library(dplyr)
filter(df, number1 < number2)
Victor Yan
  • 2,778
  • 2
  • 20
  • 25
3

I find using %in$ instead of == can solve this issue although I am still wondering why. For example, instead of: df[df$num == 1,] use: df[df$num %in% c(1),] will work.

  • 2020, working in R 3.6.3 and using `df[df$col1 %in% c("Whatever"), ]` bring me a table with no empty NA indexed rows. Whereas using the equal symbol like this: `df[df$col1 == "Whatever", ]` bring back the initial problem. A filtered table with empty rows indexed with NA. – Corina Roca Nov 24 '20 at 17:27
1
   > example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
    > example
      var1 var2
    1    A    X
    2 <NA>    Y
    3    A    Z
    > example[example$var1=="A",]
       var1 var2
    1     A    X
    NA <NA> <NA>
    3     A    Z

Probably this must be your result u are expecting...Try this try using which condition before condition to avoid NA's

  example[which(example$var1=="A"),]
      var1 var2
    1    A    X
    3    A    Z
0

Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Jan Šimbera
  • 429
  • 3
  • 14