Fast integer-only standard deviation?

Question

Ages ago I had an extremely fast integer-only standard deviation function (in C) that would return values that were "reasonably" accurate using no division or multiplication, just shifts and adds. I have since lost that code, and Google has been unable to help me find anything like it, and my discrete math skills are a bit too rusty to re-derive it.

In my specific case, I have a list of 14-bit ADC values for which I want to very quickly compute a "close enough" standard deviation on an 8-bit processor that lacks floating point hardware.

Does this ring a bell with anyone?

[This](http://www.strchr.com/standard_deviation_in_one_pass) looks interesting -- single-pass, and only a few multiplications and divisions. — Kerrek SB, Oct 14 '13 at 22:01
@KerrekSB: that satisfies the single pass requirement, but not the integer-only requirement. — Paul R, Oct 14 '13 at 22:04
You can always leave the division-by-N till the end; everything else works integrally. — Kerrek SB, Oct 14 '13 at 22:04
But then sum(x^2) can get very large - each term is 28 bits so you quickly exceed 32 bits, which might get expensive on an 8 bit CPU. OTOH that might still be acceptable, so long as you have integer library support for 64 bit ints. — Paul R, Oct 14 '13 at 22:14
Actually, one-pass isn't a hard requirement if multiple passes are faster overall: The values are in RAM and aren't going anywhere. If it matters, this is on an AVR processor (ATmega328P) with its clock maxed out at 20 MHz. But I will be calculating it every pass, so being able to quickly drop the oldest value and include the newest would be a super-bonus! — BobC, Oct 14 '13 at 23:54
It will vary from 4 to 128, depending on the system dynamics. — BobC, Oct 17 '13 at 19:07

score 7 · Accepted Answer · edited May 23 '17 at 11:45

7

Based on the link that @KerrekSB provided, we can rework the algorithm to use only integer arithmetic.

uint32_t std_var (uint16_t a[], uint16_t n) {
    if (n == 0) return 0;
    uint64_t sum = 0;
    uint64_t sq_sum = 0;
    for(unsigned i = 0; i < n; ++i) {
       uint32_t ai = a[i];
       sum += ai;
       sq_sum += ai * ai;
    }
    uint64_t N = n;
    return (N * sq_sum - sum * sum) / (N * N);
}

To get the standard deviation, take the square root of the result. To implement the integer square root, you can choose one of the multitude of answers provided by:

Looking for an efficient integer square root algorithm for ARM Thumb2

The algorithm does not try to account for any rounding errors during the intermediate calculations, it simply assumes all the values will fit, so no rounding is needed. This is what allows for the formula to be presented in a straightforward manner. Optimized algorithms for single pass variance usually perform intermediate divisions in an attempt to compensate for errors due to cancellation. For example (this is from Wikipedia):

double std_var_stable (uint16_t a[], uint16_t n) {
    if (n == 0) return 0;
    unsigned i;
    double mean = 0;
    double M2 = 0;
    for(i = 0; i < n; ++i) {
       double delta = a[i] - mean;
       mean += delta / (i + 1);
       M2 += delta * (a[i] - mean);
    }
    return M2/n;
}

However, intermediate divisions is not a suitable nor desirable for an integer math only algorithm.

edited May 23 '17 at 11:45

Community

1
1

answered Oct 14 '13 at 23:01

jxh

64,506
7
96
165

My initial intuition was that the integer code would not be able to follow so closely to the traditional floating point version and still be fast enough. The above code is simply beautiful. And the variance is actually better, since I can square my comparison values as constants the compiler will pre-compute. – BobC Oct 14 '13 at 23:59
If I make the internal variables static, I should be able to make this an incremental algorithm by subtracting off the last value and its square from the associated accumulators before including the newest value. – BobC Oct 15 '13 at 00:01
If I keep my list to under 16 16-bit values, I should be good using values over the full input range. And longer lists should work if my values have fewer bits (the ATmega ADC is 10 bits, but I'm summing 16 values read in rapid succession to get 4 more bits per "sample"). – BobC Oct 15 '13 at 00:10
+1 for nice use of `uint32_t` and `n==0`. Good chance OP can use `var*var` and negate the need for the final `sqrt()`. – chux - Reinstate Monica Oct 15 '13 at 00:25
And if the list length is a power of 2, the final division can be replaced by shifts, especially if you first factor out a factor of N to conserve bits. – BobC Oct 15 '13 at 00:26
I marked this as the answer since it will form the basis of my own implementation, despite the fact that it needs further optimizations (e.g., use lists with length 2^n to permit div and mul to be removed, add bit-width parameter, add range overflow check, add continuous/windowed capability, add dynamic list length capability, etc.).. – BobC Oct 16 '13 at 20:27

Fast integer-only standard deviation?

1 Answers1