0

I have some data where people vote on things, and it would be nice to have an average for each item of how everyone who has voted on it has voted. You can think of the votes as a stream of constantly incoming numbers. Now I can figure out the average exactly but to do so I have to store two numbers, either the total or the current average and the count of how many items have been seen so far. If I do this I can use

AVG[n+1] = (AVG[n]*count + item)/(count+1)

but this is a pain since it forces me to store two pieces of data for each item I want to have votes for. There is another way which I know of called the moving average or iterator average which can work on streaming data but will only give an approximate average like so:

AVG[n+1] = alpha*(item - AVG[n]) + AVG[n]

where alpha is some small fixed learning rate. This simply tries to move the new average in the direction of the new item and by an amount proportional to the difference between that new item and the current estimate. This gives me a way to only have to store one number (the current average) and still be able to update it when a new item comes in but at the cost of this only being an approximation.

I would like to know if there are any known bounds on the error this method introduces... is there a formula to estimate how far off from the truth this estimate is, and also how should I choose a good alpha? More info on this in this post and this question.

Community
  • 1
  • 1
hackartist
  • 4,968
  • 4
  • 29
  • 47
  • 1
    There should be `AVG[n]` added to the product in second equation. – Dialecticus Aug 02 '12 at 12:11
  • Is storing `count` along with `average` so hard?! You just need `average = (average * count + item) / (count + 1); ++count;`. Although you should be careful with the first method, because it can overflow! – Shahbaz Aug 02 '12 at 12:36
  • Also, with the second method, after `n` numbers, the first number has a coefficient of `alpha^n`. (To get an idea, imagine `alpha = 1/2` and `n = 10`. Therefore, the first number is being divided by 1024). Therefore, depending on `alpha`, you are really looking a fixed number of passed elements, rather than the whole history. Is a windowed average good for you? – Shahbaz Aug 02 '12 at 12:40
  • 1
    There is no error bound. The "moving average" effectively ignores items before the `O(1/alpha)` most recent, so with enough samples you can have any error you want. – comingstorm Aug 02 '12 at 18:33
  • Also, I will second @Shahbaz -- your declared motivation for the question is very strange. I'd really like to know: how is keeping a vote count "a pain"? – comingstorm Aug 02 '12 at 18:42
  • All I had meant by it is a pain is that it doubles the amount of data that needs to be stored and I have a lot of items to be voted on... As I am using MySQL to store this data, it also means I need another column which makes each row wider and thus slightly slower for some operations. – hackartist Aug 02 '12 at 20:02
  • 1
    Thanks! I don't know the details of your setup -- but the reason I asked is because it sounded like premature optimization. Best to make it work right, then figure out what is actually costing the most time and space once it's running. Most likely, there will be places you can save a few bytes or cycles that will be less disruptive to your application... – comingstorm Aug 02 '12 at 21:58

0 Answers0