Incrementally compute summary statistics of big array in Python

Question

Imagine you had a really big array whose collection of elements you could not fit into the memory of a computer. How would you compute the mean, median, standard deviation and other summary statistics values of this array in Python?

I found this post that explains the mathematics of computing mean value incrementally and also provides a Python function which takes a list or an iterator object, but since one may not always have access to an iterator object I implemented it as a class which behaves similar to collections.Counter. But how would one go about computing things like std, min, max, median, skewness, kurtosis, etc?

The code below is a minimal working example that implements the incremental mean, min and max and shows where the rest would fit:

class Inc_sumstats(object):
    def __init__(self):
        self.length = 0
        self.mean = 0
        #self.std = 0
        self.min = np.inf
        self.max = 0
        #self.median = 0
        #self.skewness = 0
        #self.kurtosis = 0

    def update(self, num):
        self.length += 1
        self.mean = (self.mean * (self.length - 1) + num) / self.length
        #self.std = ...
        self.min = num if num < self.min else self.min
        self.max = num if num > self.max else self.max
        #self.median = ...
        #self.skewness = ...
        #self.kurtosis = ...
        return self

Update:

I'm aware of similar questions on the site, but I have found no solution so far address anything more advanced than mean. Please link questions or mark as duplicate if I am missing something in my background research.

Since the mean changes when you increment the array, I believe you'll have to compute the std from scratch again. — lordingtar, Mar 30 '16 at 19:08
@Natecat The array could potentially a massive stream of email character lengths, or any other type of realtime stream of data. Or it could just be stored on a big hard drive. — Ulf Aslak, Mar 30 '16 at 19:10
The min and max should be fairly simple trivial. For the median the most effective method is to have two heaps that you keep balanced as described [here](http://stackoverflow.com/questions/15319561/how-to-implement-a-median-heap). The heaps can be kept in memory or saved to the hard disk if need be. The other three I'm not quite sure — Jules, Mar 30 '16 at 19:14
@JulesTamagnan You are right min and max are super simple, I updated the code in the question. — Ulf Aslak, Mar 30 '16 at 19:48
At the moment, it sounds like you don't actually have a question, you just want someone to implement four online statistics formulae for you. — DSM, Mar 30 '16 at 20:05
@DSM Sure that would be great, cause I don't know how. And isn't that the point of SO? I asked this question because I wanted it to exist on SO, so that if someone in the future had the same question they wouldn't have to go through the trouble of asking it, or even worse, implement it for a one-time use and let the code rot in some dusty office computer. I'm not expecting an immediate solution, it's not homework, and maybe it needs a bounty at some point, but I would really love it if there was a solid answer to this question. — Ulf Aslak, Mar 30 '16 at 20:27

score 2 · Answer 1 · edited May 23 '17 at 11:48

What you're looking for is an online algorithm for order statistics. An online algorithm is kind of like a generator for some statistic; it accumulates data as it's read from memory or disk, so the programmer can handle memory management concerns and still get the correct output.

There's a lot of CS theory behind the implementation of these algorithms, but you can read more about it here: https://en.wikipedia.org/wiki/Selection_algorithm#Online_selection_algorithm

The mathematics is somewhat intuitive, though: your class should update the number of elements and recalculate the mean, min, max, kurtosis, std-dev etc. as a function of the previous values, and return these values as a tuple. I refer you to this question, with an exhaustive answer as to how to construct online statistics algorithms:

"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

So the linked question is basically IT. I updated this questions title so it explicitly asks for an implementation in Python, because otherwise this is clearly a duplicate. Thanks for your direction, will +1 but not accept. — Ulf Aslak, Mar 30 '16 at 19:42

Incrementally compute summary statistics of big array in Python

Update:

1 Answers1