How to efficiently calculate a running standard deviation?

Question

I have an array of lists of numbers, e.g.:

[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
     ...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)

What I would like to do is efficiently calculate the mean and standard deviation at each index of a list, across all array elements.

To do the mean, I have been looping through the array and summing the value at a given index of a list. At the end, I divide each value in my "averages list" by n (I am working with a population, not a sample from the population).

To do the standard deviation, I loop through again, now that I have the mean calculated.

I would like to avoid going through the array twice, once for the mean and then once for the SD (after I have a mean).

Is there an efficient method for calculating both values, only going through the array once? Any code in an interpreted language (e.g. Perl or Python) or pseudocode is fine.

Different language, but same algorithm: http://stackoverflow.com/questions/895929/how-do-i-determine-the-standard-deviation-stddev-of-a-set-of-values — dmckee --- ex-moderator kitten, Jul 23 '09 at 23:18
Thanks, I'll check that algorithm out. Sounds like what I need. — Alex Reynolds, Jul 23 '09 at 23:27
Thanks for pointing me to the right answer, dmckee. I'd like to give you the "best answer" checkmark, if you'd like to take a moment to add your answer below (if you'd like the points). — Alex Reynolds, Jul 24 '09 at 04:36
Also, there are several examples at http://rosettacode.org/wiki/Standard_Deviation — glenn jackman, Jul 24 '09 at 13:21
Wikipedia has a Python implementation http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm — Hamish Grubijan, Jul 27 '11 at 21:34

score 124 · Accepted Answer · edited Oct 30 '18 at 16:19

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

Wikipedia: Algorithms for calculating variance

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

+1, for taking care about deleting values from Welford's algorithm — Svisstack, Oct 06 '13 at 20:57
Nice answer, +1 for reminding the reader of the difference between a population stddev and a sample stddev. — Assad Ebrahim, Nov 11 '13 at 15:53
After coming back to this question after all these years, I just wanted to say a word of thanks for taking the time to provide a great answer. — Alex Reynolds, Mar 22 '18 at 17:01

score 78 · Answer 2 · edited Aug 23 '10 at 10:48

78

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x² (call it 'sum_x2') as you go. The value of the standard deviation is then:

stdev = sqrt((sum_x2 / n) - (mean * mean))

where

mean = sum_x / n

This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.

You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

edited Aug 23 '10 at 10:48

compie

9,449
15
51
73

answered Jul 23 '09 at 23:39

Jonathan Leffler

666,971
126
813
1,185

This is what I was going to suggest. It's the best and fastest way, assuming precision errors are not a problem. – Ray Hidayat Jul 24 '09 at 00:08
2

I decided to go with Welford's Algorithm as it performs more reliably with the same computational overhead. – Alex Reynolds Jul 29 '09 at 23:34
3

This is a simplified version of the answer and may give non-real results depending on the input (i.e., when sum_x2 < sum_x1 * sum_x1). To ensure a valid real result, go with `sd = sqrt(((n * sum_x2) - (sum_x1 * sum_x1)) / (n * (n - 1))) – Dan Tao Oct 08 '09 at 15:17
@Dan: am I missing something? Your expression appears to be different from mine - as in, guaranteed to produce a different result - because you've multiplied sum_x2 by n but not made a compensating multiplication of sum_x1 * sum_x1? – Jonathan Leffler Oct 08 '09 at 16:42
3

@Dan points out a valid issue - the formula above breaks down for x>1 because you end up taking the sqrt of a negative number. The Knuth approach is: sqrt((sum_x2 / n) - (mean * mean)) where mean = (sum_x / n). – G__ Jul 27 '10 at 04:12
@flies: The answer has changed since I left that comment 1 year ago and Greg left his over two months ago. The formula used to be sqrt((sum_x2 - sum_x1 * sum_x1) / (n - 1)), which, unless I'm mistaken, was actually incorrect. – Dan Tao Oct 08 '10 at 13:16
Dividing by N gives the maximum likelihood estimate of variance, but it's biased to the low side because it uses the sample mean rather than the true mean. Dividing by N - 1 gives you an unbiased estimate of the variance. – Bob Carpenter Mar 24 '18 at 00:24
1

@UriLoya — you've not said anything about how you are calculating the values. However, if you use `int` in C to store the sum of squares, you run into overflow problems with the values you list. – Jonathan Leffler May 05 '20 at 13:29

Marc Liyanage · Answer 3 · 2021-04-07T16:21:40.707

Here is a literal pure Python translation of the Welford's algorithm implementation from http://www.johndcook.com/standard_deviation.html:

https://github.com/liyanage/python-modules/blob/master/running_stats.py

import math

class RunningStats:

    def __init__(self):
        self.n = 0
        self.old_m = 0
        self.new_m = 0
        self.old_s = 0
        self.new_s = 0

    def clear(self):
        self.n = 0
    
    def push(self, x):
        self.n += 1
    
        if self.n == 1:
            self.old_m = self.new_m = x
            self.old_s = 0
        else:
            self.new_m = self.old_m + (x - self.old_m) / self.n
            self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)
        
            self.old_m = self.new_m
            self.old_s = self.new_s

    def mean(self):
        return self.new_m if self.n else 0.0

    def variance(self):
        return self.new_s / (self.n - 1) if self.n > 1 else 0.0
    
    def standard_deviation(self):
        return math.sqrt(self.variance())

Usage:

rs = RunningStats()
rs.push(17.0)
rs.push(19.0)
rs.push(24.0)

mean = rs.mean()
variance = rs.variance()
stdev = rs.standard_deviation()

print(f'Mean: {mean}, Variance: {variance}, Std. Dev.: {stdev}')

This should be the accepted answer as it's the only one that is both correct and shows the algorithm, with reference to Knuth. — Johan Lundberg, May 31 '16 at 20:52
To the contributors who recently edited this answer, I had to reject your edit because I think it was incorrect. The edit removed the n == 1 special case in the push method, but I think that that case is required for correct results after the clear() method is used, I suspect you overlooked that. — Marc Liyanage, Apr 07 '21 at 16:24

score 26 · Answer 4 · answered Jul 24 '09 at 02:32

Perhaps not what you were asking, but ... If you use a numpy array, it will do the work for you, efficiently:

from numpy import array

nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
              (0.00, 0.02, 0.02, 0.03, 0.02),
              (0.01, 0.02, 0.02, 0.03, 0.02),
              (0.01, 0.00, 0.01, 0.05, 0.03)))

print nums.std(axis=1)
# [ 0.0116619   0.00979796  0.00632456  0.01788854]

print nums.mean(axis=1)
# [ 0.022  0.018  0.02   0.02 ]

By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:

http://lingpipe-blog.com/2009/03/19/computing-sample-mean-variance-online-one-pass/

score 15 · Answer 5 · answered Dec 30 '13 at 01:46

The Python runstats Module is for just this sort of thing. Install runstats from PyPI:

pip install runstats

Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.

from runstats import Statistics

stats = [Statistics() for num in range(len(data[0]))]

for row in data:

    for index, val in enumerate(row):
        stats[index].push(val)

    for index, stat in enumerate(stats):
        print 'Index', index, 'mean:', stat.mean()
        print 'Index', index, 'standard deviation:', stat.stddev()

Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.

Disclaimer: I am the author the Python runstats module.

Nice module. It'd be interesting if there was a `Statistics` has a `.pop` method so rolling statistics could also be calculated. — Gustavo Bezerra, Sep 07 '16 at 05:13
@GustavoBezerra ``runstats`` does not maintain an internal list of values so I'm not sure that's possible. But pull requests are welcome. — GrantJ, Sep 08 '16 at 17:03

Sinan Ünür · Answer 6 · 2009-07-23T23:41:10.300

Statistics::Descriptive is a very decent Perl module for these types of calculations:

#!/usr/bin/perl

use strict; use warnings;

use Statistics::Descriptive qw( :all );

my $data = [
    [ 0.01, 0.01, 0.02, 0.04, 0.03 ],
    [ 0.00, 0.02, 0.02, 0.03, 0.02 ],
    [ 0.01, 0.02, 0.02, 0.03, 0.02 ],
    [ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];

my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures

for my $ref ( @$data ) {
    $stat->add_data( @$ref );
    printf "Running mean: %f\n", $stat->mean;
    printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__

Output:

C:\Temp> g
Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566

score 8 · Answer 7 · edited Apr 21 '17 at 15:27

Have a look at PDL (pronounced "piddle!").

This is the Perl Data Language which is designed for high precision mathematics and scientific computing.

Here is an example using your figures....

use strict;
use warnings;
use PDL;

my $figs = pdl [
    [0.01, 0.01, 0.02, 0.04, 0.03],
    [0.00, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.00, 0.01, 0.05, 0.03],
];

my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );

say "Mean scores:     ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms):  ", $rms;

Which produces:

Mean scores:     [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms):  [0.011661904 0.009797959 0.0063245553 0.017888544]

Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".

However it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)

For more PDL information have a look at:

This is not a running calculation. – Jake Oct 30 '17 at 20:54 — Jake, Oct 30 '17 at 20:54

Stephen Simmons · Answer 8 · 2009-07-23T23:48:18.900

How big is your array? Unless it is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.

My preference would be to use the numpy array maths extension to convert your array of arrays into a numpy 2D array and get the standard deviation directly:

>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0) 
array([ 1. ,  1. ,  0.5,  1.5,  1.5,  1.5])
>>> a.mean(axis=0)
array([ 2. ,  3. ,  4.5,  4.5,  5.5,  6.5])

If that's not an option and you need a pure Python solution, keep reading...

If your array is

x = [ 
      [ 1, 2, 4, 3, 4, 5 ],
      [ 3, 4, 5, 6, 7, 8 ],
      ....
]

Then the standard deviation is:

d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N)  for sx, sx2 in zip(sum_x, sum_x2) ]

If you are determined to loop through your array only once, the running sums can be combined.

sum_x  = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
   for i, t in enumerate(v):
   sum_x[i] += t
   sum_x2[i] += t**2

This isn't nearly as elegant as the list comprehension solution above.

I do actually have to deal with zillions of numbers, which is what motivates my need for an efficient solution. Thanks! — Alex Reynolds, Jul 24 '09 at 04:33
its not about how big the data set is, its about how OFTEN, i have to do 3500 different standard deviation calculations over 500 elements on each calculation per second — PirateApp, May 25 '19 at 16:25

score 1 · Answer 9 · answered Jul 23 '09 at 23:14

1

You could look at the Wikipedia article on Standard Deviation, in particular the section about Rapid calculation methods.

There's also an article I found that uses Python, you should be able to use the code in it without much change: Subliminal Messages - Running Standard Deviations.

answered Jul 23 '09 at 23:14

Lasse V. Karlsen

350,178
94
582
779

Subliminal Messages version is not very stable numerically. – Dave Jun 06 '18 at 22:16

score 1 · Answer 10 · answered Jul 23 '09 at 23:31

1

I think this issue will help you. Standard deviation

answered Jul 23 '09 at 23:31

peterdemin

526
8
24

+1 @Lasse V. Karlsen's link to Wikipedia's good, but this is the right algorithm I've used... – kenny Jul 24 '09 at 17:44

user541686 · Answer 11 · 2013-04-27T02:19:20.967

1

Here's a "one-liner", spread over multiple lines, in functional programming style:

def variance(data, opt=0):
    return (lambda (m2, i, _): m2 / (opt + i - 1))(
        reduce(
            lambda (m2, i, avg), x:
            (
                m2 + (x - avg) ** 2 * i / (i + 1),
                i + 1,
                avg + (x - avg) / (i + 1)
            ),
            data,
            (0, 0, 0)))

edited Apr 27 '13 at 02:19

answered Apr 27 '13 at 01:34

user541686

189,354
112
476
821

score 1 · Answer 12 · edited Nov 03 '14 at 14:41

1

n=int(raw_input("Enter no. of terms:"))

L=[]

for i in range (1,n+1):

    x=float(raw_input("Enter term:"))

    L.append(x)

sum=0

for i in range(n):

    sum=sum+L[i]

avg=sum/n

sumdev=0

for j in range(n):

    sumdev=sumdev+(L[j]-avg)**2

dev=(sumdev/n)**0.5

print "Standard deviation is", dev

edited Nov 03 '14 at 14:41

J. Steen

14,900
15
57
62

answered Nov 03 '14 at 14:38

Anuraag

11
1

score 1 · Answer 13 · edited May 23 '17 at 12:18

1

As the following answer describes: Does pandas/scipy/numpy provide a cumulative standard deviation function? The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that you'll have to convert your data into a pandas dataframe (or a series if it is 1D), but there are functions for that.

edited May 23 '17 at 12:18

Community

1
1

answered Dec 22 '16 at 09:51

Ramon Crehuet

2,631
18
35

Dave · Answer 14 · 2018-06-06T21:40:07.470

I like to express the update this way:

def running_update(x, N, mu, var):
    '''
        @arg x: the current data sample
        @arg N : the number of previous samples
        @arg mu: the mean of the previous samples
        @arg var : the variance over the previous samples
        @retval (N+1, mu', var') -- updated mean, variance and count
    '''
    N = N + 1
    rho = 1.0/N
    d = x - mu
    mu += rho*d
    var += rho*((1-rho)*d**2 - var)
    return (N, mu, var)

so that a one-pass function would look like this:

def one_pass(data):
    N = 0
    mu = 0.0
    var = 0.0
    for x in data:
        N = N + 1
        rho = 1.0/N
        d = x - mu
        mu += rho*d
        var += rho*((1-rho)*d**2 - var)
        # could yield here if you want partial results
   return (N, mu, var)

note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).

In a class it would look like this:

class RunningMeanVar(object):
    def __init__(self):
        self.N = 0
        self.mu = 0.0
        self.var = 0.0
    def push(self, x):
        self.N = self.N + 1
        rho = 1.0/N
        d = x-self.mu
        self.mu += rho*d
        self.var += + rho*((1-rho)*d**2-self.var)
    # reset, accessors etc. can be setup as you see fit

This also works for weighted samples:

def running_update(w, x, N, mu, var):
    '''
        @arg w: the weight of the current sample
        @arg x: the current data sample
        @arg mu: the mean of the previous N sample
        @arg var : the variance over the previous N samples
        @arg N : the number of previous samples
        @retval (N+w, mu', var') -- updated mean, variance and count
    '''
    N = N + w
    rho = w/N
    d = x - mu
    mu += rho*d
    var += rho*((1-rho)*d**2 - var)
    return (N, mu, var)

score 0 · Answer 15 · answered Nov 11 '20 at 20:06

Here is a practical example of how you could implement a running standard deviation with python and numpy:

a = np.arange(1, 10)
s = 0
s2 = 0
for i in range(0, len(a)):
    s += a[i]
    s2 += a[i] ** 2 
    n = (i + 1)
    m = s / n
    std = np.sqrt((s2 / n) - (m * m))
    print(std, np.std(a[:i + 1]))

This will print out the calculated standard deviation and a check standard deviation calculated with numpy:

0.0 0.0
0.5 0.5
0.8164965809277263 0.816496580927726
1.118033988749895 1.118033988749895
1.4142135623730951 1.4142135623730951
1.707825127659933 1.707825127659933
2.0 2.0
2.29128784747792 2.29128784747792
2.5819888974716116 2.581988897471611

I am just using the formula described in this thread:

stdev = sqrt((sum_x2 / n) - (mean * mean))

How to efficiently calculate a running standard deviation?

15 Answers15

Linked

Related