Refer to columns by number in := assignment when using data.table

Question

As per Row wise matrix operations in R I would like to apply a row-wise function on a data.table I have. I wish to calculate, per row, the mean of a number of columns in that row. My current attempt is:

columns <- c(1,5,10,15,20) # Actually obtained via grep
my.data.table[,"average" := mean(columns),with=FALSE] # Or...
my.data.table[,average := mean(columns)]

This, unfortunately, simply returns the mean of the 'columns' vector rather than mean of the columns to which they refer. Is there a way to refer to these columns by number?

Here's the average I'm trying to achieve:

key  a b c average
A    5 5 5 5
B    1 2 3 2
C    2 4 9 5

No problem, I appreciate the effort! Let's see if someone else can find a solution. — Ina, May 25 '12 at 14:17

score 4 · Answer 1 · edited Sep 27 '19 at 03:20

Another alternative is to construct the call you'd really like to carry out, and then eval() it within DT[]. This is the strategy described in sections 1.5 and 1.6 of the data.table FAQ (viewed by typing vignette("datatable-faq")).

This approach runs 3-5 times faster than does that involving rowMeans(). (The disparity is due to rowMeans()' initial time-consuming conversion of data.frames to matrices, as Matthew Dowle points out in comments below.)

## Prepare data
library(data.table)
N <- 1000000
DT <- data.table(ID = 1:N,
                 Year1 = rnorm(N),
                 Year2 = rnorm(N),
                 Year3 = rnorm(N),
                 Year4 = rnorm(N))    
x <- c(2, 3, 4, 5)

## Construct the desired expression:   (Year1 + Year2 + Year3 + Year4)/4
addCols <- paste(names(DT)[x], collapse = " + ")
e <- paste("(", addCols, ")/", length(x), sep="")
e <- parse(text=e)[[1]]

## Compare timings
system.time(x2 <- DT[,eval(e)])
#    user  system elapsed 
#    0.11    0.00    0.11 
system.time(x1 <- rowMeans(DT[, ..x]))
#    user  system elapsed 
#    0.53    0.14    0.77 

## Check results
# all.equal(x1,x2)
# [1] TRUE

+10 Nice! See first line of `rowMeans`: `if (is.data.frame(x)) x=as.matrix(x)`. So that's copying into the `matrix` structure first. That tallies with there being a difference between `user` and `elapsed` for `rowMeans`, which you avoid with the direct `eval`. Mult `N` by `10` and then `10` again and difference should expand. — Matt Dowle, May 25 '12 at 16:38
@MatthewDowle - Yep, that's it. Thanks for tracking that down! — Josh O'Brien, May 25 '12 at 16:48
NP. Could you review [my answer](http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r) to the very top voted `data.frame` question, and give it a start off 0 if it's ok? — Matt Dowle, May 25 '12 at 16:54

score 3 · Accepted Answer · edited Sep 27 '19 at 03:20

3

Here are two possible solutions. They are basically both from the link that you've already provided, so maybe I missed something with this question. Here we go:

Solution 1 (using rowMeans):

library(data.table)
N <- 1000000
my.data.table <- data.table(ID = 1:N,
                            Year1 = rnorm(N),
                            Year2 = rnorm(N),
                            Year3 = rnorm(N),
                            Year4 = rnorm(N))

x <- c(2, 3, 4, 5)
system.time(x1 <- rowMeans(my.data.table[, ..x]))
   user  system elapsed 
   0.08    0.00    0.08

Solution 2: Get it into long format first. I thought this was faster, mainly because of Matthew's comment in the other question that says that data.table is meant for the DT[,mad(variable),by=group] syntax. I think I'm missing something, but don't see what:

library(reshape2)
DT <- as.data.table(melt(as.data.frame(my.data.table), id.var="ID"))
setkey(DT, ID)
system.time(x2 <- DT[, mean(value), by="ID"][[2]])
   user  system elapsed 
  11.28    0.00   11.33 
all.equal(x1, x2)
[1] TRUE

edited Sep 27 '19 at 03:20

MichaelChirico

31,197
13
98
169

answered May 25 '12 at 14:36

Christoph_J

6,428
7
39
55

1

+1 I can't beat 0.08. In this case there isn't any grouping. Grouping where every row is a group isn't grouping really. I agreed `rowMeans` was best (afaik) in the comments in that [other question](http://stackoverflow.com/questions/9564588/row-wise-matrix-operations-in-r), and also mentioned "bare-bones" `.colSums()`,`.rowSums()`,`.colMeans()` and `.rowMeans()` where ultimate speed is required, added in R 2.15.0. – Matt Dowle May 25 '12 at 15:12
OK, good to know I'm not missing anything here. Thanks for the clarification. – Christoph_J May 25 '12 at 15:16
2

@MatthewDowle and Christoph_J -- It looks like I found something that's 3-5 times faster. Will be interested if either of you have insight about why it is so much quicker. – Josh O'Brien May 25 '12 at 16:22

user1317221_G · Answer 3 · 2012-05-25T14:40:51.040

1

ok another go...

would this be ok

x<-1:5
y<-1:5
z<-1:5
xy<-data.table(x,y,z)
id<-c("x","y")
newxy<-rowMeans(xy[, id, with=FALSE])

edited May 25 '12 at 14:40

answered May 25 '12 at 14:02

user1317221_G

13,886
2
46
72

That syntax wouldn't work with data.table and colMeans isn't really applicable here. – Ina May 25 '12 at 14:07
2

Only trouble with editing is that the comments now don't match up. So just to clarify, `rowMeans` does work fine with `data.table`, Ina's comment was about the original answer which did something else. – Matt Dowle May 25 '12 at 15:01

Refer to columns by number in := assignment when using data.table

3 Answers3