1

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.

For example,

iris_test = iris
iris_test$Sepal.Length[1] = NA

iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA

It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?

Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
htlee
  • 121
  • 11
  • I intentionally give the awkward condition to get NA or an empty vector. – htlee Nov 19 '19 at 05:08
  • There are at least 4 base methods for subsetting dataframes. A1) "naked" `"["` with logical expression; A2) "semi-clothed" `"[,expr]"`using the logical expression of your choice combined with `&!is.na(expr); B) the `subset` function which uses the A2 strategy internally, and C) the `"["` function with a `which()` argument. They each have the advantages and disadvantages. This was discussed here and in several cogent comments to it: https://stackoverflow.com/questions/4935479/how-to-combine-multiple-conditions-to-subset-a-data-frame-using-or/4935551?r=SearchResults&s=1|63.8638#4935551 – IRTFM Nov 19 '19 at 16:10
  • @42- Thank you for the information. – htlee Dec 13 '19 at 04:41

2 Answers2

2

Your assumption is kind of correct that is you get NA values when there is NA in the data.

The comparison yields NA values

iris_test$Sepal.Length < 0
#[1]    NA FALSE FALSE FALSE.....

When you subset a vector with NA it returns NA. See for example,

iris$Sepal.Length[c(1, NA)]
#[1] 5.1  NA 

This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)

iris$Sepal.Length[FALSE]
#numeric(0)
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Thank you for your reply. Is there any explanation for the NA result (e.g. deduction from several first-hand principles)? or is it a first-hand rule per se? – htlee Nov 19 '19 at 05:13
  • 1
    @HoonTaek yes, it is present in `?Extract`. `When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA in the corresponding element of a logical, integer, numeric, complex or character result` – Ronak Shah Nov 19 '19 at 05:28
2

Adding to @Ronak's

The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:

# Is unknown greater than 0? Result is unknown (NA) 
NA > 0
#NA

# Is unknown less than 0? Output is unknown (NA). 
NA < 0
# NA

# Is unknown equal to unknown? Output is unknown(NA).  
NA == NA
# NA

Getting back to your data, when you do: iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).

The question is "Is unknown less than 0?". The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).

There is a function called is.na() which I'm sure you're aware of to handle missing values.

Hope that adds some insight to your question.

deepseefan
  • 3,506
  • 3
  • 15
  • 30
  • Thank you for the great explanation. R should say unknown in response to questions about the unknown. – htlee Nov 19 '19 at 05:55