R: Rolling window function with adjustable window and step-size for irregularly spaced observations

Question

Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.

Here is sample data (see "Edit:" for a larger sample set)

set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
      time   measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140        NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924

Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:

 [1] 1.0222694
 [2]        NA
 [3]        NA
 [4] 1.0126639
 [5] 0.9965048
 [6] 0.9514456
 [7] 1.0518228
 [8]        NA
 [9]        NA
[10]        NA

Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )

What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:

windo <- 5  # duration in seconds of window

# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5) 

l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)

as.vector(rbind(l2,l1))

Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.

Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).

Below is a extra info: A larger sample set

Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:

set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),] 

# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)

Have you seen RcppRoll and its friends? I made a cool function for windowed averaging in [this question](http://stackoverflow.com/questions/18448082/r-data-table-with-rollapply/23533485#23533485); is that similar to what you're after? — bright-star, Jul 18 '14 at 22:33
@TrevorAlexander Thanks for pointing me to `RcppRoll`; I will have a look. As for the function you wrote, the window is based on **number of observations** not **temporal duration**, as far as I can tell, which is not what I'm after. — Jota, Jul 19 '14 at 00:20
Yeah, I think you'll need code like what you have in your question to bin temporal duration into discrete indices. — bright-star, Jul 19 '14 at 00:23
We need a larger realistic sample set: One that has a realistic number of NA's and where the minimum spacing along the time dimension is represented. — IRTFM, Jul 19 '14 at 00:37

Roland · Accepted Answer · 2014-07-24T15:31:45.657

Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times, 
                          NumericVector & vals, 
                          double start,
                          const double winlen, 
                          const double winshift) {
  int n = ceil((max(times) - start) / winshift);
  NumericVector winvals;
  NumericVector means(n);
  int ind1(0), ind2(0);
  for(int i=0; i < n; i++) {
    if (times[0] < (start+winlen)) {
      while((times[ind1] <= start) & 
                (times[ind1+1] <= (start+winlen)) & 
                (ind1 < (times.size() - 1))) {
        ind1++;
      }    

      while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
        ind2++;
      }  

      if (times[ind1] >= start) {
        winvals = vals[seq(ind1, ind2)];
        means[i] = mean(winvals);
      } else {
        means[i] = NA_REAL;
      }
      } else {
        means[i] = NA_REAL;
    }

    start += winshift;    
  }

   return means;
}

Testing it:

set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_

rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694        NA        NA 1.0126639 0.9965048 0.9514456 1.0518228        NA        NA        NA

With your list of data.frames (using data.table):

set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),] 

# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)

library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?

dat <- rbindlist(dat)

system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user  system elapsed 
#1.51    0.02    1.54 
print(res)
#           i        V1
#      1:   1 1.0217126
#      2:   1 0.9334415
#      3:   1 0.9609050
#      4:   1 1.0123473
#      5:   1 0.9965922
#     ---              
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300        NA
#6000600: 300        NA

Yes, sorry. I forgot to remove this line. I've edited, but can't test right now (will try later today). Hope it still works. — Roland, Jul 24 '14 at 07:21
I've run it on a win machine now, where the compiler complained about `vals` being a constant. So, I changed that, too. Timings are different due to changes in the function and a different CPU speed. — Roland, Jul 24 '14 at 15:00
It works well! It is fast and east-to-use. The downside is that you need to hard code the function you want to use (e.g. `mean` in this case.), afaik. There is an issue when the window occurs entirely before the first time (i.e. see `testdf — Jota, Jul 24 '14 at 15:02
There is probably a way to pass an R function to it and of course it needs some input checking and as you noticed some edge cases need to be fixed (I`ve fixed the one you found). I'll leave the rest to you. — Roland, Jul 24 '14 at 15:32
In general, the more specialized a function the more efficient it can be. If you pass an R function to this you'll pay for that with performance loss. — Roland, Jul 26 '14 at 13:01

James Trimble · Answer 2 · 2014-07-20T13:30:39.317

Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.

rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
    # time_col is name of time column
    # fun is function to apply to the subsetted data frames
    # min_window is the start time of the earliest window

    times <- DF[, time_col]

    # window_starts is a vector of the windows' minimum times
    window_starts <- seq(from=min_window, to=max(times), by=step_size)

    # The i-th element of window_rows is a vector that tells us the row numbers of
    # the data-frame rows that are present in window i 
    window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })

    window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
    data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}

rolling_summary(DF=dat,
                time_col="time",
                fun=function(DF) mean(DF$measure),
                window_size=5,
                step_size=2.5,
                min_window=-2.5)

+1 Very nice. Seems to me (from my interpretation of `Rprof` output) that `lapply(window_starts, function(x) which(times>=x & times — Jota, Jul 22 '14 at 14:40

score 2 · Answer 3 · answered Jul 22 '14 at 16:18

Here are some functions that will give the same output on your first example:

partition <- function(x, window, step = 0){
    a = x[x < step]    
    b = x[x >= step]
    ia = rep(0, length(a))
    ib = cut(b, seq(step, max(b) + window, by = window))    
    c(ia, ib)
}

roll <- function(df, window, step = 0, fun, ...){
    tapply(df$measure, partition(df$time, window, step), fun, ...)
}

roll_steps <- function(df, window, steps, fun, ...){
    X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
    names(X) = steps
    X
}

Output for your first example:

> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
        1         2         3         4         5 
       NA 1.0126639 0.9514456        NA        NA 

$`2.5`
        0         1         2         3         4 
1.0222694        NA 0.9965048 1.0518228        NA

You can also ignore missing values this way easily:

> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
        1         2         3         4         5 
0.7275438 1.0126639 0.9514456 0.9351326       NaN 

$`2.5`
        0         1         2         3         4 
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983

This can also be used for a list of data.frames:

> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)

arvi1000 · Answer 4 · 2014-07-22T22:22:03.340

Ok, how about this.

library(data.table)
dat <- data.table(dat)
setkey(dat, time)

# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
  pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
  return(stat_fn(my_dt$measure[pos_vec]))
}

# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)

# sapply'ing the function above over vector of start times 
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary, 
                 window_len=5, stat_fn=mean, my_dt=dat)

On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs

(I guess that's slower than James' solution, but maybe helpful anyway to see another approach) — arvi1000, Jul 22 '14 at 22:09

score 2 · Answer 5 · edited Aug 04 '14 at 21:47

2

Here's another attempt to use pure data.table approach and its between function.

Have compared Rprof against the above answers (except @Rolands answer) and it seems the most optimized one. Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.

Using your dat from above

library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
  temp <- seq.int(-slide, max(dat$time), by = slide)
  temp <- cbind(temp, temp + wind)
  setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}

Rollfunc(dat, time, measure, 5, 2.5)

## [1] 1.0222694        NA        NA 1.0126639 0.9965048 0.9514456 1.0518228        NA        NA
## [10]        NA

You can also specify the functions and its arguments, i.e., for example:

Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)

will also work

Edit: I did some benchnarks against @Roland and his method clearly wins (by far), so I would go with the Rcpp aproach

edited Aug 04 '14 at 21:47

Jota

16,103
7
54
89

answered Jul 24 '14 at 21:39

David Arenburg

87,271
15
123
181

How much did it win by? I'm curious because the data.table tends to have some very strong performance. If there was a decent leap in performance to be had outside of "making it c" then I think Hadley Wickam (and folks) would very much be interested in generalizing it and making R win there too. – EngrStudent Jul 19 '17 at 14:55
1

@EngrStudent Please ignore this answer as this is a very old one when I didn't know data.table so well. If you ever see `apply(..., 1, ...)` anywhere near `data.table`- you have my permission to downvote. I think today I would approach this problem by doing something like [this](https://stackoverflow.com/questions/44406040/sum-over-past-window-size-dates-per-group/44407291#44407291) but I'm to lazy to modify this answer afte 3 years passed. – David Arenburg Jul 19 '17 at 14:58

R: Rolling window function with adjustable window and step-size for irregularly spaced observations

5 Answers5

Linked