49

What is the best (fastest) way to implement a sliding window function with the data.table package?

I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:

library(data.table)
df <- data.frame(
  id=30000,
  date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))

get_window <- function(date, factor1, factor2) {
  criteria <- data.table(
    date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
    factor1=as.integer(factor1),
    factor2=as.integer(factor2)
  )
  return(dt[criteria][, value])
}

output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]

for(i in nrow(output):1) {
  print(i)
  output[i, window_median:=median(get_window(date, factor1, factor2))]
}
geneorama
  • 3,057
  • 4
  • 24
  • 33
alan
  • 3,717
  • 6
  • 31
  • 48
  • +1 too. Can you provide more info about data size and times. From your comment to Alan's answer (alan and Alan are different people?), it takes 6.4s (vs 973s for `data.frame`) and you'd like to improve 6.4s further? – Matt Dowle Aug 08 '12 at 14:30
  • Alan and alan are different people :). The dataset has ~650,000 rows. I came up with a solution that works much faster but is very memory intensive. Any ideas on how it can be improved further? – alan Aug 10 '12 at 15:02

4 Answers4

9

data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :

Is there a fast way to run a rolling regression inside data.table?

Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :

Rolling median algorithm in C

The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).

Community
  • 1
  • 1
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
  • 7
    Can we rise the priority of FR#2185? "Add features/documentation for sliding windows". From my perspective it doesn't need to be any roll sum, mean, etc. It is better to have kind of framework function "rollfun=" or the methods below: I've already tried self joins with roll=30, mult='all', allow.cartesian to achieve it, without success. Also the "n" would be good to accept vector, not only scalar. – jangorecki Jan 08 '14 at 15:28
  • 7
    @MusX Ok I've upped the priority to top. – Matt Dowle Jan 08 '14 at 23:52
4

I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.

df <- data.frame(
  id=30000,
  date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))

unique_set <-  data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
  output2 <- rbind(output2, unique_set[, date:=original_date-i])
}    

setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]

That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated

alan
  • 3,717
  • 6
  • 31
  • 48
  • Just on first glance, the `rbind` inside the `for` will be using too much RAM. Must be a more direct way to do that bit. – Matt Dowle Aug 10 '12 at 15:57
  • [This question](http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c) mentions binary search, R and C for rolling median. Looks promising for you to investigate further; i.e., think about the _algorithm_. – Matt Dowle Aug 10 '12 at 16:22
0

This solution works but it takes a while.

df <- data.frame(
  id=30000,
  date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

myFun <- function(dff,df){
    median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}

week_Med <- apply(df,1,myFun,df=df)

week_Med_df <- cbind(df,week_Med)
Paul Rigor
  • 826
  • 12
  • 22
Alan
  • 2,753
  • 1
  • 12
  • 11
  • 1
    Thanks! It seems to take longer than the for loop though. The timings I'm getting from system.time are 973s for your code 6.4s for the loop. I think the difference must be the use of the data.table package – alan Jul 29 '12 at 17:19
0

I address this in a related thread: https://stackoverflow.com/a/62399700/7115566

I suggest looking into the frollapply function. For instance, see below

library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
             x = sample(1:10, 100, replace = T),
             y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)