35

I have a data.table called enc.per.day for encounters per day. It has 2403 rows in which a date of service is specified and the number of patients seen on that day. I wanted to see the median number of patients seen on any type of weekday.

enc.per.day[,list(patient.encounters=median(n)),by=list(weekdays(DOS))]

That line gives an error

Error in [.data.table(enc.per.day, , list(patient.encounters = median(n)), : columns of j don't evaluate to consistent types for each group: result for group 4 has column 1 type 'integer' but expecting type 'double'

The following all work well

tapply(enc.per.day$n,weekdays(enc.per.day$DOS),median)
enc.per.day[,list(patient.encounters=round(median(n))),by=list(weekdays(DOS))]
enc.per.day[,list(patient.encounters=median(n)+0),by=list(weekdays(DOS))]

What is going on? It took me a long time to figure out why my code would not work.

By the way the underlying vector enc.per.day$n is an integer

storage.mode(enc.per.day$n)

returns "integer". Further, there are no NAs anywhere in the data.table.

Farrel
  • 9,584
  • 19
  • 57
  • 95

1 Answers1

50

TL;DR wrap median with as.double()

median() 'trips up' data.table because --- even when only passed integer vectors --- median() sometimes returns an integer value, and sometimes returns a double.

## median of 1:3 is 2, of type "integer" 
typeof(median(1:3))
# [1] "integer"

## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"

Reproducing your error message with a minimal example:

library(data.table)
dt <- data.table(patients = c(1:3, 1:2), 
                 weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))

dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) : 
#   columns of j don't evaluate to consistent types for each group: 
#   result for group 2 has column 1 type 'double' but expecting type 'integer'

data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.


data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.

In this case, using as.double(median(X)) instead of median(X) provides a suitable fix.

(By the way, your version using round() worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3))).)

Brandon Bertelsen
  • 40,095
  • 33
  • 147
  • 245
Josh O'Brien
  • 148,908
  • 25
  • 332
  • 435
  • 1
    @Matthew Dowle -- Thanks for adding those details about how **data.table** initializes and allocates space for the results vector. – Josh O'Brien Sep 05 '12 at 16:02
  • Is it possible to hava a median of the same type as the value? Therefore even if I would have values like =1,1,1,2,2,2,2 it should not result in median=1.5 instead it should show median=2. – lony Oct 31 '14 at 13:00
  • As an example to the suggestion above, do this DT[ , c(as.double(lapply(.SD,median)) , .N),by=x, .SDcols=c("x", "y", "z")] instead of DT[ , c(lapply(.SD,median) , .N),by=x, .SDcols=c("x", "y", "z")] – Bhoom Suktitipat Oct 26 '15 at 01:17
  • 1
    @JoshO'Brien 1. I cannot reproduce this error in `data.table` v 1.10.4.3. patients is integral before `[, (), by=]` and then comes out with `typeof` of double. 2. I have created similar errors by taking the `max` of integral values, surely the max of integral values is integral.. I would post a question but not sure if it will be flagged as duplicate. 3. It turns out `-Inf` is a double value but not an integer value in R, so there are subtle points here, but I can't articulate them. – AdamO Apr 11 '18 at 17:39
  • @AdamO -- On your (1), I suspect this is related to the fact that **data.table** will now internally optimize a call to `median()`, using what the package authors call `_GForce_`. (See `?datatable.optimize` for the details.) As part of that effort, they must have taken care of the infelicity discussed here.Your number (3) is an interesting observation, and I've got no idea about (2). Cheers. – Josh O'Brien Apr 11 '18 at 17:54
  • @JoshO'Brien Well I have posted what, in earnest, I suspect to be a sequel to this question. [If it is of any interest to you, here it is](https://stackoverflow.com/questions/49781741/aggregation-and-typing-inconsistency-in-data-table). – AdamO Apr 11 '18 at 18:06