Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)