Taking subset of data to do calculation in data.table

Question

Let say I have this data.table:

df = data.table(date = c(20180101, 20180102, 20180103, 20180104, 20180105, 20180106, 20180107, 20180108, 20180109, 20180110, 20180111, 20180112, 20180113, 20180114, 20180115, 20180116, 20180117, 20180118), value = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18))

I want to do some calculations (e.g. mean) that is using subset of data. For example: In 20180103, the average will be the sum of (yesterday) 20180102 and (today) 20180103 value ((2+3)/2 = 2.5). This is then rolling until the end of the period.

Result is like this:

    date    mean
20180102     1.5
20180103     2.5
20180104     3.5
20180105     4.5
....

Obviously I can write a for loop, subset the data for each iteration then calculate the mean, store the data and output the result. It is deemed too slow using for loop, and using foreach I don't know how to save the result...

The for loop is like:

datelist = df[, .(date)] 

# initialize the object
data = NA
temp = 0
for (i in 2:nrow(datelist)) {
     today = as.numeric(datelist[i])
     yesterday = as.numeric(datelist[i-1])

     temp = df[date >= yesterday & date <= today]

     temp = temp[, .(mean(value))]

     temp = cbind(datelist[i], mean = temp$V1)


     if (is.na(data)[1]){
         data=temp

         } else {
          data=rbind(data,temp)

         }


}

You can see I first subset the data and call it temp then do the calculation (average, use it to do lm, whatever function then stack it into data object)

This is slow and inefficient as I have millions of data point

Is there anyway I can do this in data.table syntax:

result = df[, { data = .SD[date >= yesterday & date <= today]
                mean = mean(data$value)
                list(mean = mean)}, by=.(date)]

I don't know how to express yesterday and today?? so that yesterday will be, in the for loop case, i-1 and today is i?

What i understand when doing by=.(date) is that data.table will look at each date and calculate whatever function you give in. If I can get the value (i.e. i) of which date the data.table is looking at now, then the value (i-1) will be yesterday...

Thanks

Do you want the current mean at any given time or for any given time the mean of "today" and "yesterday"? — niko, Jan 01 '19 at 16:39
Hi Nate, let me write the for loop out and you will understand what I mean. I don't know how to explain it.. — Gabriel, Jan 01 '19 at 16:40
Is this a rolling mean? If so, see e.g. [Improve rolling mean usage in data.table](https://stackoverflow.com/a/53693454/1851712) or [Adaptive moving average - top performance in R](https://stackoverflow.com/a/21368246/1851712) — Henrik, Jan 01 '19 at 16:44
Actually the focus is not on how to calculating mean but how to subset the data in data.table syntax... — Gabriel, Jan 01 '19 at 16:52
This question is similar to mine. They also use for loop, however it is very slow. https://stackoverflow.com/questions/34244660/rolling-regression-with-data-table-update — Gabriel, Jan 01 '19 at 17:23
I was trying to make my question as abstract as possible, hence introduce the mean function. The actual problem I have is I need to run a rolling regression, using the past 3 years of daily data from the firms. I need to write a for loop to extract the past 3 years of daily data EACH DAY to run the regression for each firm in the dataframe. Running the for loop is very time consuming as I need to loop over each day and each firm (as I have 100000 firms each day and 10 years of data. (That is 3.65e8 of data point). Hence the question is precisely: How do I subset a data without a for loop. — Gabriel, Jan 01 '19 at 18:35
So for example assume we only have one firm. I want to utilize the data.table by=.(date) function. So that after using by=.(date), each day I want to make the lambda function inside ' j ' to subset the past 1095 of data point (3 years) then run the regression to get the output. If it is still not understandable I think it is needed to close the question.... Sorry for not explain clearly and thanks for everyone's input... — Gabriel, Jan 01 '19 at 18:51
"_I need to run a rolling regression_" ~~> google "rolling regression data.table R site:stackoverflow.com". Among the first few hits, an answer by the `data.table` author: [Is there a _fast_ way to run a rolling regression inside data.table?](https://stackoverflow.com/a/12157723/1851712) — Henrik, Jan 01 '19 at 19:19
See also `RcppRoll` as described e.g. here [Rolling regressions in R](https://codereview.stackexchange.com/questions/125509/rolling-regressions-in-r) — Henrik, Jan 01 '19 at 19:27

score 3 · Answer 1 · answered Jan 01 '19 at 16:47

3

You can use the shift operator in the data.table j clause:

df[order(date),
   rollmean := (value + shift(value, n = 1, type = "lag"))/2][]

        date value rollmean
 1: 20180101     1       NA
 2: 20180102     2      1.5
 3: 20180103     3      2.5
 4: 20180104     4      3.5
 5: 20180105     5      4.5
 6: 20180106     6      5.5
 7: 20180107     7      6.5
 8: 20180108     8      7.5
 ...

answered Jan 01 '19 at 16:47

zack

4,570
1
17
25

I have think of this answer but if I need to lag 3 years of data then i need to shift 360*3 times... – Gabriel Jan 01 '19 at 17:03

niko · Answer 2 · 2019-01-01T18:29:20.947

Solution

What about something like this

(df$value[-nrow(df)]+df$value[-1] ) / 2
# yields
# [1]  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5

And here creating the data frame

data.table::data.table(date = .subset2(df,1)[-1], 
                       mean = (df$value[-nrow(df)]+df$value[-1] ) / 2)
#        date mean
# 1  20180102  1.5
# 2  20180103  2.5
# 3  20180104  3.5
# 4  20180105  4.5
# 5  20180106  5.5
# ...

with the data you provided.

Benchmarks

Here are some benchmarking figures:

# create a bigger data frame
dfLarge <- data.table::data.table(
  date  = seq(as.Date('1989-01-01'),as.Date('2019-01-01'),1),
  value = 1:10958
)
microbenchmark::microbenchmark(sol = {
  data.table::data.table(date = .subset2(dfLarge,1)[-1], 
                         mean = (dfLarge$value[-nrow(dfLarge)]+dfLarge$value[-1] ) / 2)
})
# Unit: microseconds
#  expr     min      lq     mean  median      uq      max neval
#   sol 367.955 423.203 921.4908 530.781 788.969 22095.85   100

Addendum

If the main topic here isn't the task per se but subsetting efficiently, then specify that to begin with what your aim exactly is (subsetting itself is rather broad a topic, so add detail about the task(s) that need to be done). That way you are more likely to find what you are seeking and other users do not waste any effort.

That being said, here is a link providing some great information about subsetting in R.

Thanks Nate let me try this out! I think subset2 is the function I am after. — Gabriel, Jan 01 '19 at 18:06

score 0 · Answer 3 · answered Jan 01 '19 at 17:27

0

Staying away from for loops you could use a purrr map function like this:

nvals <- nrow(df) # get the number of rows
vals <- df$value # get the value vector
output <- map(1:nvals, function(x) mean(vals[c(x-1, x)])
output <- unlist(output)
df <- cbind(df, output)

The output vector is:

 1.0  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5

Which is I think what you want.

answered Jan 01 '19 at 17:27

SteveM

1,694
2
10
13

You should probably add the package name (`purrr`?) as `map` is not a base function – niko Jan 01 '19 at 18:08

Taking subset of data to do calculation in data.table

3 Answers3

Solution

Benchmarks

Addendum