10

I have a data.table that contains some groups. I operate on each group and some groups return numbers, others return NA. For some reason data.table has trouble putting everything back together. Is this a bug or am I misunderstanding? Here is an example:

dtb <- data.table(a=1:10)
f <- function(x) {if (x==9) {return(NA)} else { return(x)}}
dtb[,f(a),by=a]

Error in `[.data.table`(dtb, , f(a), by = a) : 
  columns of j don't evaluate to consistent types for each group: result for group 9 has     column 1 type 'logical' but expecting type 'integer'

My understanding was that NA is compatible with numbers in R since clearly we can have a data.table that has NA values. I realize I can return NULL and that will work fine but the issue is with NA.

mnel
  • 105,872
  • 25
  • 248
  • 242
Alex
  • 17,745
  • 33
  • 112
  • 182
  • 2
    possible duplicate of [Why does median trip up data.table (integer versus double)?](http://stackoverflow.com/questions/12125364/why-does-median-trip-up-data-table-integer-versus-double) – Matt Dowle Sep 13 '12 at 07:41
  • 2
    I had a related problem once as well: [Splitting a data.table with the by-operator: functions that return numeric values and/or NAs fail](http://stackoverflow.com/questions/7960798/splitting-a-data-table-with-the-by-operator-functions-that-return-numeric-value) – Christoph_J Sep 13 '12 at 08:57
  • @Alex When the question is about an error message, try searching S.O. for the error message. For example [this search](http://stackoverflow.com/search?q=%5Bdata.table%5D+%22columns+of+j+don%27t+evaluate+to+consistent+types+for+each+group%22&submit=search) returns the 2 links above and a 3rd one too. – Matt Dowle Sep 13 '12 at 10:13
  • thank you for the references! i tried searching for NA in data.table on google but didn't hit much. i'll try searching for the error message next time. appreciate the help. – Alex Sep 13 '12 at 16:30

3 Answers3

14

From ?NA

NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw. There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language.

You will have to specify the correct type for your function to work -

You can coerce within the function to match the type of x (note we need any for this to work for situations with more than 1 row in a subset!

f <- function(x) {if any((x==9)) {return(as(NA, class(x)))} else { return(x)}}

More data.table*ish* approach

It might make more data.table sense to use set (or :=) to set / replace by reference.

set(dtb, i = which(dtb[,a]==9), j = 'a', value=NA_integer_)

Or := within [ using a vector scan for a==9

dtb[a == 9, a := NA_integer_]

Or := along with a binary search

setkeyv(dtb, 'a')
dtb[J(9), a := NA_integer_] 

Useful to note

If you use the := or set approaches, you don't appear to need to specify the NA type

Both the following will work

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
dtb[a==9,a := NA]

dtb <- data.table(a=1:10)
setkeyv(dtb,'a')
set(dtb, which(dtb[,a] == 9), 'a', NA)

This gives a very useful error message that lets you know the reason and solution:

Error in [.data.table(DTc, J(9), :=(a, NA)) : Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)


Which is quickest

with a reasonable large data.set where a is replaced in situ

Replace in situ

library(data.table)

set.seed(1)
n <- 1e+07
DT <- data.table(a = sample(15, n, T))
setkeyv(DT, "a")
DTa <- copy(DT)
DTb <- copy(DT)
DTc <- copy(DT)
DTd <- copy(DT)
DTe <- copy(DT)

f <- function(x) {
    if (any(x == 9)) {
        return(as(NA, class(x)))
    } else {
        return(x)
    }
}

system.time({DT[a == 9, `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.95    0.24    1.20 
system.time({DTa[a == 9, `:=`(a, NA)]})
##    user  system elapsed 
##    0.74    0.17    1.00 
system.time({DTb[J(9), `:=`(a, NA_integer_)]})
##    user  system elapsed 
##    0.02    0.00    0.02 
system.time({set(DTc, which(DTc[, a] == 9), j = "a", value = NA)})
##    user  system elapsed 
##    0.49    0.22    0.67 
system.time({set(DTc, which(DTd[, a] == 9), j = "a", value = NA_integer_)})
##    user  system elapsed 
##    0.54    0.06    0.58 
system.time({DTe[, `:=`(a, f(a)), by = a]})
##    user  system elapsed 
##    0.53    0.12    0.66 
# The are all the same!
all(identical(DT, DTa), identical(DT, DTb), identical(DT, DTc), identical(DT, 
    DTd), identical(DT, DTe))
## [1] TRUE

Unsurprisingly the binary search approach is the fastest

mnel
  • 105,872
  • 25
  • 248
  • 242
  • interesting! could you please elaborate a bit? i didn't realize there were different types of `NA` values.. – Alex Sep 13 '12 at 04:58
  • 1
    Look at `?NA` (quoting) *There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ * – mnel Sep 13 '12 at 05:07
  • This is a bit silly of R, no? In this toy example you can just do `as(NA, class(x))` but in a situation where you don't know if the result value would be `integer` or `double`, for example, what is one to do – Alex Sep 13 '12 at 05:10
  • If your data is of class `integer`, then it will return `NA_integer`, if it is a double, then it will be `numeric` class and return `NA_real_`. Most of the time you don't care, and the more data.tableish approaches don't required you to specify – mnel Sep 13 '12 at 05:18
  • yeh, but actually in my specific case what is happening is things in my `f` are converted to an `xts`, some computations done and an answer returned. so i can't use your suggestion unfortunately. i ended up removing `NA` values from the return value in my use-case. – Alex Sep 13 '12 at 05:21
  • Post a small example of what your actual function and data are then we could approach your problem – mnel Sep 13 '12 at 05:57
  • There's a few vector scans (`==` in `i`) here that appear to be intended as binary search? Also, I voted to close as dup (see other question), but maybe @Alex could update question title instead since `:=` seems appropriate too and there's good info in this answer. – Matt Dowle Sep 13 '12 at 07:52
  • 2
    @mnel, since you're keying `DT` on `a`, maximum speed for this simple example would seem to be achieved using DT[J(9), a := NA_integer_]`? Perhaps it's a moot point given the OP's previous comment, though. – BenBarnes Sep 13 '12 at 08:32
  • Fixed the binary search issue. I thought I had tried that and it had thrown an error. Binary search wins by far! – mnel Sep 13 '12 at 10:42
0

you can also do something like this :

dtb <- data.table(a=1:10)

mat <- ifelse(dtb == 9,NA,dtb$a)

The above command will give you matrix but you can change it back to data.table

new.dtb <- data.table(mat)
new.dtb
     a
 1:   1
 2:   2
 3:   3
 4:   4
 5:   5
 6:   6
 7:   7
 8:   8
 9:  NA
10:  10

Hope this helps.

user1021713
  • 1,893
  • 8
  • 25
  • 39
-1

If you want to assign NAs to many variables, you could use the approach suggested here:

v_1  <- c(0,0,1,2,3,4,4,99)
v_2  <- c(1,2,2,2,3,99,1,0)
dat  <-  data.table(v_1,v_2)

for(n in 1:2) {
  chari <-  paste0(sprintf('v_%s' ,n), ' %in% c(0,99)')
  charj <- sprintf('v_%s := NA_integer_', n)
  dat[eval(parse(text=chari)), eval(parse(text=charj))]
}
Community
  • 1
  • 1
sdaza
  • 962
  • 11
  • 29