16

I am doing an exercise of a programming book A Book on C . The exercise suggests that to find average of a group of numbers, algorithm:

avg += (x - avg) / i;

is better than:

sum += x;
avg = sum / i;

'x' is a variable used to store the input numbers. It also suggests beside preventing overflow, the first algorithm do have some other benefits than the second algorthim, can anyone help me? Thanks!

Oliver
  • 503
  • 3
  • 6
  • 9
  • Heh. I happen to have a first edition copy of that book right here at my desk. What chapter is this in? – T.E.D. Jun 03 '11 at 13:29
  • 4
    counter-argument --- the 2nd method, ignoring overflows, is (probably) [faster (for billions of operations)](http://ideone.com/5kHKK) because only 1 division is performed :) – pmg Jun 03 '11 at 13:37
  • @pmg Just Awesome comment.other then overflow why should we select I'st method.Can't clear from the answer posted by guys here. – Algorithmist Jun 03 '11 at 13:40
  • @pmg: Of course, if we're interested in a running average, then neither is faster. – Oliver Charlesworth Jun 03 '11 at 13:47
  • In what what way should be the first algorithm better? – Tomas Aug 15 '11 at 11:46
  • 1
    @pmg Maybe I'm missing something, but how does the first method perform two divisions? – Michael Mior Oct 22 '12 at 23:56
  • 1
    The first method has a loop wrapped around the `+=` expression; the second also has a loop wrapped around the `+=` expression. Thus in the first, there is a division per iteration, but the second features one division only. The reason for using the first is related to numerical stability and avoiding overflows. See the discussion at [How to efficiently calculate a running standard deviation](http://stackoverflow.com/questions/1174984/), especially the Wikipedia pages (and the related, perhaps more relevant [Arithmetic Mean](http://en.wikipedia.org/wiki/Arithmetic_mean). – Jonathan Leffler Oct 23 '12 at 06:51

7 Answers7

9

I'm assuming we're talking about floating-point arithmetic here (otherwise the "better" average will be terrible).

In the second method, the intermediate result (sum) will tend to grow without bound, which means you'll eventually lose low-end precision. In the first method, the intermediate result should stay of a roughly similar magnitude to your input data (assuming your input doesn't have an enormous dynamic range). which means that it will retain precision better.

However, I can imagine that as i gets bigger and bigger, the value of (x - avg) / i will get less and less accurate (relatively). So it also has its disadvantages.

Oliver Charlesworth
  • 252,669
  • 29
  • 530
  • 650
  • +1 could you please explain "lose low-end precision".And which one we must use. – Algorithmist Jun 03 '11 at 13:43
  • 3
    @Algorithmist: Consider the structure of floating-point representation; a *mantissa* and an *exponent*. The mantissa represents the precision, i.e. the significant digits, and there are a fixed number of them. As your numbers grow larger, the exponent will begin to increase, which means that the your significant digits begin to move away from the binary-point. – Oliver Charlesworth Jun 03 '11 at 13:45
  • As I see it, the problem with `(x - avg) / i` is only partly caused by `i` getting bigger. The `(x - avg)` part itself is also a problem if many of the numbers are close to the average, because subtracting nearby floating-point numbers loses precision. – j_random_hacker Jun 04 '11 at 06:29
4

It is better in the sense that it computes a running average, i.e. you don't need to have all your numbers in advance. You can calculate that as you go, or as numbers become available.

mbatchkarov
  • 13,762
  • 7
  • 52
  • 75
  • 2
    And you'll be able to compute each incremental average in constant time as opposed to O(N) – pepsi Jun 03 '11 at 13:31
  • Thanks! It said "In this exercise you are to continue the work you did in the previous exercise. If you run the better_average program taking the input from a file that contain some ordinary numbers, then the first algorithm and the second algorithm seem to produce the identical answer. Find a situation where this is not the case. That is, demonstrate experimentally that the better average really is better, even when sum does not overflow." Could you tell me which situation would make that happen? – Oliver Jun 03 '11 at 13:33
  • 6
    -1: Both of them are capable of calculating a running average. – Oliver Charlesworth Jun 03 '11 at 13:41
1

The latter algorithm is faster than the former, because you have to perform n operations (actually, the latter requires performing 2*n operations). But it is true that the first prevents overflow. For example, if you have this set of 1000 numbers: 4000000*250, 1500000*500, 2000000*500, the total sum of all of the integers will be 2'750.000.000, but the upper bound of a c++ int data type is a 2,147,483,647. So, we are dealing iin this case with an overflow problem. But if you perform the first algorithm, then you are able to deal with this problem.

So I recommend that you use the first algorithm if it is likely to occur the overflow, otherwise it only will add extra operations. If you decide to use the first anyways, then I reccomend that you use a type with a larger range.

Jesufer Vn
  • 11,350
  • 6
  • 18
  • 25
1

I like the second method (summing in a loop and dividing at the end) better, and can identify the second method much faster than the first.

The performance differences, if any, are irrelevant.

And, if a sum of values overflows a big enough data type, you'll probably have more problems than calculating an average.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
pmg
  • 98,431
  • 10
  • 118
  • 189
  • 2
    The numerical differences may well be relevant. It's exactly this sort of consideration that leads to the Kahan summation algorithm and similar. – Oliver Charlesworth Jun 03 '11 at 13:51
  • +1 Oli: just making sure it's explicitly stated -- the divide after sum method is more reliable (barring overflow) – pmg Jun 03 '11 at 14:00
1

Ok, the answer lies not in overflowing the sum (since that is ruled out), but as Oli said in "losing the low-end precision". If the average of the numbers you are summing is much larger than the distance of each number from the average, the 2nd approach will lose mantissa bits. Since the first approach is only looking at the relative values, it doesn't suffer from that problem.

So any list of numbers that are greater than, say, 60 million (for single-precision floating point) but the values don't vary by more than 10 or so should show you the behavior.

If you are using double-precision floats, the average value should be much higher. Or the deltas much lower.

David Winant
  • 812
  • 4
  • 8
  • A note of caution with this: be careful that your values are representable in the precision you pick. Then include enough values in the list. – David Winant Jun 03 '11 at 15:41
0
sum += x;
avg = sum / i;

In above code suppose we have numbers as 10000,20000 ,..that is numbers containing large number of digits then value in sum may exceed its MAX value,Which is not the case in I'st one as sum is always divided by no of elements prior to storing in it.

Although because of large data types present in programming language this may not be a problem.Thus what the

Experts say "Use Data Type As Per Your Application and Requirement."

Algorithmist
  • 6,295
  • 7
  • 33
  • 49
-3

How about calculating like this, assuming ints are in an array?:

sum += x[i] / N; rem += x[i] % N;
avg = sum + rem/N;

If N is large (0xFFFFF) and x[i] are all small so rem adds up to 0xFFFF (largest int) then an overflow might happen.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
eye
  • 1