1

As per Row wise matrix operations in R I would like to apply a row-wise function on a data.table I have. I wish to calculate, per row, the mean of a number of columns in that row. My current attempt is:

columns <- c(1,5,10,15,20) # Actually obtained via grep
my.data.table[,"average" := mean(columns),with=FALSE] # Or...
my.data.table[,average := mean(columns)]

This, unfortunately, simply returns the mean of the 'columns' vector rather than mean of the columns to which they refer. Is there a way to refer to these columns by number?

Here's the average I'm trying to achieve:

key  a b c average
A    5 5 5 5
B    1 2 3 2
C    2 4 9 5
Community
  • 1
  • 1
Ina
  • 4,030
  • 5
  • 27
  • 42
  • No problem, I appreciate the effort! Let's see if someone else can find a solution. – Ina May 25 '12 at 14:17

3 Answers3

4

Another alternative is to construct the call you'd really like to carry out, and then eval() it within DT[]. This is the strategy described in sections 1.5 and 1.6 of the data.table FAQ (viewed by typing vignette("datatable-faq")).

This approach runs 3-5 times faster than does that involving rowMeans(). (The disparity is due to rowMeans()' initial time-consuming conversion of data.frames to matrices, as Matthew Dowle points out in comments below.)

## Prepare data
library(data.table)
N <- 1000000
DT <- data.table(ID = 1:N,
                 Year1 = rnorm(N),
                 Year2 = rnorm(N),
                 Year3 = rnorm(N),
                 Year4 = rnorm(N))    
x <- c(2, 3, 4, 5)

## Construct the desired expression:   (Year1 + Year2 + Year3 + Year4)/4
addCols <- paste(names(DT)[x], collapse = " + ")
e <- paste("(", addCols, ")/", length(x), sep="")
e <- parse(text=e)[[1]]

## Compare timings
system.time(x2 <- DT[,eval(e)])
#    user  system elapsed 
#    0.11    0.00    0.11 
system.time(x1 <- rowMeans(DT[, ..x]))
#    user  system elapsed 
#    0.53    0.14    0.77 

## Check results
# all.equal(x1,x2)
# [1] TRUE
MichaelChirico
  • 31,197
  • 13
  • 98
  • 169
Josh O'Brien
  • 148,908
  • 25
  • 332
  • 435
  • 2
    +10 Nice! See first line of `rowMeans`: `if (is.data.frame(x)) x=as.matrix(x)`. So that's copying into the `matrix` structure first. That tallies with there being a difference between `user` and `elapsed` for `rowMeans`, which you avoid with the direct `eval`. Mult `N` by `10` and then `10` again and difference should expand. – Matt Dowle May 25 '12 at 16:38
  • @MatthewDowle - Yep, that's it. Thanks for tracking that down! – Josh O'Brien May 25 '12 at 16:48
  • NP. Could you review [my answer](http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r) to the very top voted `data.frame` question, and give it a start off 0 if it's ok? – Matt Dowle May 25 '12 at 16:54
3

Here are two possible solutions. They are basically both from the link that you've already provided, so maybe I missed something with this question. Here we go:

Solution 1 (using rowMeans):

library(data.table)
N <- 1000000
my.data.table <- data.table(ID = 1:N,
                            Year1 = rnorm(N),
                            Year2 = rnorm(N),
                            Year3 = rnorm(N),
                            Year4 = rnorm(N))

x <- c(2, 3, 4, 5)
system.time(x1 <- rowMeans(my.data.table[, ..x]))
   user  system elapsed 
   0.08    0.00    0.08

Solution 2: Get it into long format first. I thought this was faster, mainly because of Matthew's comment in the other question that says that data.table is meant for the DT[,mad(variable),by=group] syntax. I think I'm missing something, but don't see what:

library(reshape2)
DT <- as.data.table(melt(as.data.frame(my.data.table), id.var="ID"))
setkey(DT, ID)
system.time(x2 <- DT[, mean(value), by="ID"][[2]])
   user  system elapsed 
  11.28    0.00   11.33 
all.equal(x1, x2)
[1] TRUE
MichaelChirico
  • 31,197
  • 13
  • 98
  • 169
Christoph_J
  • 6,428
  • 7
  • 39
  • 55
  • 1
    +1 I can't beat 0.08. In this case there isn't any grouping. Grouping where every row is a group isn't grouping really. I agreed `rowMeans` was best (afaik) in the comments in that [other question](http://stackoverflow.com/questions/9564588/row-wise-matrix-operations-in-r), and also mentioned "bare-bones" `.colSums()`,`.rowSums()`,`.colMeans()` and `.rowMeans()` where ultimate speed is required, added in R 2.15.0. – Matt Dowle May 25 '12 at 15:12
  • OK, good to know I'm not missing anything here. Thanks for the clarification. – Christoph_J May 25 '12 at 15:16
  • 2
    @MatthewDowle and Christoph_J -- It looks like I found something that's 3-5 times faster. Will be interested if either of you have insight about why it is so much quicker. – Josh O'Brien May 25 '12 at 16:22
1

ok another go...

would this be ok

x<-1:5
y<-1:5
z<-1:5
xy<-data.table(x,y,z)
id<-c("x","y")
newxy<-rowMeans(xy[, id, with=FALSE])
user1317221_G
  • 13,886
  • 2
  • 46
  • 72
  • That syntax wouldn't work with data.table and colMeans isn't really applicable here. – Ina May 25 '12 at 14:07
  • 2
    Only trouble with editing is that the comments now don't match up. So just to clarify, `rowMeans` does work fine with `data.table`, Ina's comment was about the original answer which did something else. – Matt Dowle May 25 '12 at 15:01