9

I am trying to implement something along the lines of a Moving Average.

In this system, there are no guarantees of a quantity of Integers per time period. I do need to calculate the Average for each period. Therefore, I cannot simply slide over the list of integers by quantity as this would not be relative to time.

I can keep a record of each value with its associated time. We will have a ton of data running through the system so it is important to 'garbage collect' the old data.

It may also be important to note that I need to save the average to disk after the end of each period. However, they may be some overlap between saving the data to disk and having data from a new period being introduced.

What are some efficient data structures I can use to store, slide, and garbage collect this type of data?

Kurtis
  • 1,509
  • 1
  • 16
  • 29
  • I provided an answer which is really just a guess about your real requirements. If I got it wrong, let me know and I'll delete it. – rici Oct 15 '13 at 17:00
  • 1
    Reminds me of [this question](http://stackoverflow.com/questions/18396452/design-a-datastructure-to-return-the-number-of-connections-to-a-web-server-in-la/18396955#18396955) (it should be fairly trivial to apply the answer to this problem). – Bernhard Barker Oct 15 '13 at 17:14
  • @rici - Actually, you nailed it. Thanks for 'reading between the lines'! – Kurtis Oct 15 '13 at 17:23

1 Answers1

8

The description of the problem and the question conflict: what is described is not a moving average, since the average for each time period is distinct. ("I need to compute the average for each period.") So that admits a truly trivial solution:

For each period, maintain a count and a sum of observations.

At the end of the period, compute the average

I suspect that what is actually wanted is something like: Every second (computation period), I want to know the average observation over the past minute (aggregation period).

This can be solved simply with a circular buffer of buckets, each of which represents the value for one computation period. There will be aggregation period / computation period such buckets. Again, each bucket contains a count and a sum. Also, a current total/sum and a cumulative total sum/count are maintained. Each observation is added to the current total/sum.

At the end of a each computation period:

  • subtract the sum/count for the (circularly) first period from the cumulative sum/count
  • add the current sum/count to the cumulative sum/count
  • report the average based on the cumulative sum/count
  • replace the values of the first period with the current sum/count
  • clear the current sum/count
  • advance the origin of the circular buffer.

If you really need to be able to compute at any time at all the average of the previous observations over some given period, you'd need a more complicated data structure, basically an expandable circular buffer. However, such precise computations are rarely actually necessary, and a bucketed approximation, as per the above algorithm, is usually adequate for data purposes, and is much more sustainable over the long term for memory management, since its memory requirements are fixed from the start.

Community
  • 1
  • 1
rici
  • 201,785
  • 23
  • 193
  • 283
  • In terms of actual data structures, a linked list would be easy enough to implement, as you're just constantly adding your new period to the end of it. When you need to "garbage collect" old data, you can simply delete the elements of the linked list up to the desired period. I don't think the list necessarily needs to be circular. – AndyG Oct 15 '13 at 17:03
  • @andyG: you could use a linked list, but the number of periods in the list is constant so there's no need for memory management at all. A circular buffer is a very easy data structure (just use i%n as an index). If you wanted to keep all of the observations, a linked list would be simpler, but the overhead is quite large since the payload size of each node is comparable to a pointer, and you end up with a cache-unfriendly collection of allocated nodes. You could think of the circular buffer as a way to optimize the memory management. – rici Oct 15 '13 at 17:16
  • Thank you for clarifying that this isn't a Moving Average. I think I need to brush up on my terminology a bit. If you have any suggestions for the title so I don't confuse future visitors, please let me know! – Kurtis Oct 15 '13 at 17:17
  • @rici: I was under the impression that OP wanted to hold onto the averages for all periods instead of just the current one. A circular buffer makes sense now. – AndyG Oct 15 '13 at 17:52