Imagine you had a really big array whose collection of elements you could not fit into the memory of a computer. How would you compute the mean, median, standard deviation and other summary statistics values of this array in Python?
I found this post that explains the mathematics of computing mean value incrementally and also provides a Python function which takes a list or an iterator object, but since one may not always have access to an iterator object I implemented it as a class which behaves similar to collections.Counter
. But how would one go about computing things like std
, min
, max
, median
, skewness
, kurtosis
, etc?
The code below is a minimal working example that implements the incremental mean
, min
and max
and shows where the rest would fit:
class Inc_sumstats(object):
def __init__(self):
self.length = 0
self.mean = 0
#self.std = 0
self.min = np.inf
self.max = 0
#self.median = 0
#self.skewness = 0
#self.kurtosis = 0
def update(self, num):
self.length += 1
self.mean = (self.mean * (self.length - 1) + num) / self.length
#self.std = ...
self.min = num if num < self.min else self.min
self.max = num if num > self.max else self.max
#self.median = ...
#self.skewness = ...
#self.kurtosis = ...
return self
Update:
I'm aware of similar questions on the site, but I have found no solution so far address anything more advanced than mean. Please link questions or mark as duplicate if I am missing something in my background research.