calculating mean and standard deviation of the data which does not fit in memory using python

Question

I have a lot of data stored at disk in large arrays. I cant load everything in memory altogether.

How one could calculate the mean and the standard deviation?

@Joni, Please read it carefully, in that question the data could be loaded in the memory, my question is when we cant load all the data, thats why the questions are not similar — Shan, Mar 26 '13 at 14:07
@Joni, it does not matter, there are many problems which could be solved using same algorithms in different perspectives, so it means whenever somebody wants to search for mean and variance for big dataset, would he be searching the title "How to efficiently calculate a running standard deviation?", do you think that somebidy will write this query or the one asked in my question? — Shan, Mar 26 '13 at 15:08
Shan, the online algorithm for calculating the mean and variance is the same whether the dataset fits into memory or not. This question already has an answer in the linked question, and in dozens of other questions on this same topic. — Joni, Mar 26 '13 at 16:20
Perhaps this question would serve as a better duplicate: http://stackoverflow.com/questions/5543651/computing-standard-deviation-in-a-stream — Joni, Mar 26 '13 at 16:31

NPE · Accepted Answer · 2013-03-26T14:08:10.813

10

There is a simple online algorithm that computes both the mean and the variance by looking at each datapoint once and using O(1) memory.

Wikipedia offers the following code:

def online_variance(data):
    n = 0
    mean = 0
    M2 = 0

    for x in data:
        n = n + 1
        delta = x - mean
        mean = mean + delta/n
        M2 = M2 + delta*(x - mean)

    variance = M2/(n - 1)
    return variance

This algorithm is also known as Welford's method. Unlike the method suggested in the other answer, it can be shown to have nice numerical properties.

Take the square root of the variance to get the standard deviation.

edited Mar 26 '13 at 14:08

answered Mar 26 '13 at 13:49

NPE

438,426
93
887
970

So I am curious: how does this work if not all data can be in memory at once? Is the assumption that `data` is an iterator that yields the next item until there is no more, and that it pages in more items from disk? – hughdbrown Mar 26 '13 at 15:49
@hughdbrown: The code is just an illustration. For example, `data` could be a generator, or a memory-mapped file. – NPE Mar 26 '13 at 15:53

BenDundee · Answer 2 · 2013-03-26T13:56:32.307

5

Sounds like a math question. For the mean, you know that you can take the mean of a chunk of data, and then take the mean of the means. If the chunks aren't the same size, you'll have to take a weighted average.

For the standard deviation, you'll have to calculate the variance first. I'd suggest doing this alongside the calculation of the mean. For variance, you have

Var(X) = Avg(X^2) - Avg(X)^2

So compute the average of your data, and the average of your (data^2). Aggregate them as above, and the take the difference.

Then the standard deviation is just the square root of the variance.

Note that you could do the whole thing with iterators, which is probably the most efficient.

edited Mar 26 '13 at 13:56

answered Mar 26 '13 at 13:48

BenDundee

3,747
2
22
32

1

The algorithm you suggest for calculating the variance is numerically unstable and the results it produces are inaccurate or even completely wrong for large datasets. See for example http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/, or the Wikipedia page on algorithms for computing variance. – Joni Mar 26 '13 at 14:37
@Joni, that's not quite right. Reading your link, the assumption is that the mean is much greater than the variance, which is not at all surprising, given the difference of squares in the formula. It looks like the difference between the mean and variance has to be several orders of magnitude before this is an issue. – BenDundee Mar 26 '13 at 15:02
On a careful read, you are right. If you calculate the running *average* of X^2 (rather than the sum, which I see often) you won't have problems with big datasets. – Joni Mar 26 '13 at 16:18

calculating mean and standard deviation of the data which does not fit in memory using python

2 Answers2

Linked