87

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

I'd like to calculate the basic statistics:

  • mean: arithmetic average
  • variance: average of squared deviations from the mean
  • standard deviation: square root of the variance
  • median: value that separates larger half of the numbers from the smaller half
  • mode: most frequent value found in the set
  • skewness: tl; dr
  • kurtosis: tl; dr

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

  • Initialize three variables: count, sum, and sum_of_squares
  • For each value:
    • Increment count.
    • Add the value to sum.
    • Add the square of the value to sum_of_squares.
  • Divide sum by count, storing as the variable mean.
  • Divide sum_of_squares by count, storing as the variable mean_of_squares.
  • Square mean, storing as square_of_mean.
  • Subtract square_of_mean from mean_of_squares, storing as variance.
  • Output mean and variance.

This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".

Ryan B. Lynch
  • 2,134
  • 3
  • 20
  • 20
  • will the data be passed in sorted, and will you know in advance the number of inputs? – chillysapien Jun 29 '09 at 15:16
  • Useful existing link on StackOverflow: http://stackoverflow.com/questions/895929/how-do-i-determine-the-standard-deviation-stddev-of-a-set-of-values – dmckee --- ex-moderator kitten Jun 29 '09 at 15:45
  • Is that integer data or float data? Do you have a max or min value? – stephan Jun 29 '09 at 15:47
  • dmckee: I'm actually using Welford's Method for the standard deviation. But I don't see anything in that link about mode, median, kurtosis, or skewness... Am I missing something? – Ryan B. Lynch Jun 29 '09 at 15:50
  • stephan: Some data sets are integers, others are floats. The population distribution is pretty close to the normal (Gaussian), so we can establish a confidence interval, but there is no hard range boundary (except x > 0, in some cases). – Ryan B. Lynch Jun 29 '09 at 15:56
  • @Ryan B. Lynch: Ah. Your text describes the naive one-pass variance algorithm, so I assumed you didn't know about Welform's approach. And I can't help on the others, either. Sorry. – dmckee --- ex-moderator kitten Jun 29 '09 at 16:18
  • These are called streaming algorithms, BTW. [ http://scholar.google.com/scholar?&q=streaming+algorithm ] Some of them are easy, some of them can be approximated to within arbitrary epsilon (with high probability), and some are hard even to approximate. – ShreevatsaR Jun 29 '09 at 18:01
  • Since mean, std, kurtosis, and skewness are just moments, I would expect you can calculate them the same way. Some q's: Do you know your data set well enough? Do you expect the statistics to vary in time? If so, and if you are wanting to capture that aspect of the data, this is a much tougher problem. If not, I would expect taking some sort of sample would be fine. (test it out a few times first) – temp2290 Oct 15 '09 at 16:54
  • A C++ implementation by http://stackoverflow.com/users/25188/john-d-cook "Calculating Percentiles in Memory-bound Applications", http://www.codeproject.com/KB/recipes/TailKeeper.aspx See also http://stackoverflow.com/questions/638030/how-to-calculate-or-approximate-the-median-of-a-list-without-storing-the-list – Alessandro Jacopson Jun 07 '12 at 19:03
  • In regards to your request for libraries, I have a Julia package for solving exactly these kinds of problems: https://github.com/joshday/OnlineStats.jl OnlineStats implements online algorithms for the mean, variance, skewness/kurtosis, histograms (from which quantiles can be estimated), and several approximation algorithms for quantiles. – joshday Nov 29 '17 at 15:05

13 Answers13

57

I use these incremental/recursive mean and median estimators, which both use constant storage:

mean += eta * (sample - mean)
median += eta * sgn(sample - median)

where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)

This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.

I would love to see an incremental mode estimator of a similar form...

UPDATE

I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function (http://en.wikipedia.org/wiki/Quantile_function) tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.

Tyler Streeter
  • 1,154
  • 10
  • 13
  • 3
    This median estimator is great. Do you know if there are similar estimators for 0.25/0.75 quantiles? – Gacek May 21 '10 at 14:28
  • 1
    @Gacek, sure: split the input stream into Lohalf < median and Hihalf > median, and use running-median on each half. – denis Jun 30 '10 at 15:23
  • 1
    @Gacek, yes, see http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=614292&tag=1 – Ian Goodfellow Mar 18 '11 at 21:17
  • 2
    @Gacek: I just updated my answer with an incremental method to estimate any quantile, where you can set p to 0.25, 0.75, or *any* value within [0,1]. – Tyler Streeter Sep 19 '11 at 22:00
  • +1. Thanks for posting this. The Mean works brilliantly! And I struggle to compute Median using this method. Not sure if I understood you correctly. Could you please advise this piece of code? http://pastebin.com/8qzusMEg – Viet Nov 03 '11 at 12:23
  • 1
    @Viet: You are using eta = 1/count both for mean and median estimates. Unfortunately, I don't think that works for the median estimate... you might just have to use the non-stationary form (constant eta), and iterate until the median stops changing. If anyone knows of a way to make this median estimate converge in the stationary case (similar to using eta=1/n for the mean estimator), I would be interested. – Tyler Streeter Nov 04 '11 at 18:59
  • Thanks again Tyler! May I ask you if I need to have the array sorted before running the median estimator? – Viet Nov 05 '11 at 02:44
  • @Viet: No, I don't think sorting should make a difference if you're already planning to iterate until the median estimate stops changing. – Tyler Streeter Nov 08 '11 at 00:12
  • 10
    This works great for mean, but I'm not seeing how it produces anything remotely close to median. Take a sequence of millisec timestamps for instance: `[1328083200000, 981014400000, -628444800000, 318240000000, 949392000000]` which have a median of `318240000000`. This equation shifts the previous median by +/- `eta` of which the recommended value was `0.001`. That's not going to do anything for big numbers like these, and it might be too big for really small numbers. How would you pick an `eta` that actually gave you the right answer without knowing the answer a priori? – mckamey Mar 22 '12 at 21:33
  • 9
    Imagine that the numbers have units, e.g., millimeters. Then it's clear eta (for the estimate of the median) has to have the same units as the measurements, and so a generic value like 0.001 simply doesn't make any sense. A seemingly-better approach is to set eta from a running estimate of the absolute deviation: for each new value `sample`, update `cumadev += abs(sample-median)`. Then set `eta = 1.5*cumadev/(k*k)`, where `k` is the number of samples seen so far. – tholy Sep 07 '13 at 13:10
  • @tholy could you explain your formula? Why 1.5? Why divide by $k^2$? – Joseph Garvin Sep 25 '19 at 05:16
  • The `k^2` is easy: `σ = cumadev/k` is the mean average deviation, a measure of the spread of individual data points. Updating the median by `σ/k` is just like updating the mean by `newmean = (k*oldmean + newpoint)/(k+1) = oldmean + Δ/(k+1)` where `Δ = newpoint - oldmean`. The 1.5 was probably just me tuning it to be a bit more accurate for some distribution (e.g., uniform or gaussian, I don't remember really). – tholy Sep 29 '19 at 09:32
54

Skewness and Kurtosis

For the on-line algorithms for Skewness and Kurtosis (along the lines of the variance), see in the same wiki page here the parallel algorithms for higher-moment statistics.

Median

Median is tough without sorted data. If you know, how many data points you have, in theory you only have to partially sort, e.g. by using a selection algorithm. However, that doesn't help too much with billions of values. I would suggest using frequency counts, see the next section.

Median and Mode with Frequency Counts

If it is integers, I would count frequencies, probably cutting off the highest and lowest values beyond some value where I am sure that it is no longer relevant. For floats (or too many integers), I would probably create buckets / intervals, and then use the same approach as for integers. (Approximate) mode and median calculation than gets easy, based on the frequencies table.

Normally Distributed Random Variables

If it is normally distributed, I would use the population sample mean, variance, skewness, and kurtosis as maximum likelihood estimators for a small subset. The (on-line) algorithms to calculate those, you already now. E.g. read in a couple of hundred thousand or million datapoints, until your estimation error gets small enough. Just make sure that you pick randomly from your set (e.g. that you don't introduce a bias by picking the first 100'000 values). The same approach can also be used for estimating mode and median for the normal case (for both the sample mean is an estimator).

Further comments

All the algorithms above can be run in parallel (including many sorting and selection algorithm, e.g. QuickSort and QuickSelect), if this helps.

I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.

In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.

stephan
  • 9,609
  • 1
  • 48
  • 61
12

I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.

Sean
  • 356
  • 4
  • 5
7

Ryan, I'm afraid you are not doing the mean and variance right... This came up a few weeks ago here. And one of the strong points of the online version (which actually goes by the name of Welford's method) is the fact that it is specially accurate and stable, see the discussion here. One of the strong points is the fact that you do not need to store the total sum or total sum of squares...

I can't think of any on-line approach to the mode and median, which seem to require considering the whole list at once. But it may very well be that a similar approach than the one for the variance and mean will work also for the skewness and kurtosis...

Community
  • 1
  • 1
Jaime
  • 59,107
  • 15
  • 108
  • 149
3

The Wikipedia article quoted in the question contains the formulas for calcualting skewness and kurtosis on-line.

For mode - I believe - there is no way doing this on-line. Why? Assume that all values of your input are different besides the last one that duplicates a previous one. In this case you have to remember all values allready seen in the input to detect that the last value duplicates a value seen befor and makes it the most frequent one.

For median it is almost the same - up to the last input you don't know what value will become the median if all input values are different because it could be before or after the current median. If you know the length of the input, you can find the median without storing all values in memory, but you will still have to store many of them (I guess around the half) because a bad input sequence could shift the median heavily in the second half possibly making any value from the first half the median.

(Note that I am refering to exact calculation only.)

Daniel Brückner
  • 56,191
  • 15
  • 92
  • 137
2

If you have billions of data points, then it's not likely that you need exact answers, as opposed to close answers. Generally, if you have billions of data points the underlying process which generates them will likely obey some kind of statistical stationarity / ergodicity / mixing property. Also it may matter whether you expect the distributions to be reasonably continuous or not.

In these circumstances, there exist algorithms for on-line, low memory, estimation of quantiles (the median is a special case of 0.5 quantile), as well as modes, if you don't need exact answers. This is an active field of statistics.

quantile estimation example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014

mode estimation example: Bickel DR. Robust estimators of the mode and skewness of continuous data. Computational Statistics and Data Analysis. 2002;39:153–163. doi: 10.1016/S0167-9473(01)00057-3.

These are active fields of computational statistics. You are getting into the fields where there isn't any single best exact algorithm, but a diversity of them (statistical estimators, in truth), which have different properties, assumptions and performance. It's experimental mathematics. There are probably hundreds to thousands of papers on the subject.

The final question is whether you really need skewness and kurtosis by themselves, or more likely some other parameters which may be more reliable at characterizing the probability distribution (assuming you have a probability distribution!). Are you expecting a Gaussian?

Do you have ways of cleaning/preprocessing the data to make it mostly Gaussianish? (for instance, financial transaction amounts are often somewhat Gaussian after taking logarithms). Do you expect finite standard deviations? Do you expect fat tails? Are the quantities you care about in the tails or in the bulk?

Matt Kennel
  • 134
  • 2
2

Everyone keeps saying that you can't do the mode in an online manner but that is simply not true. Here is an article describing an algorithm to do just this very problem invented in 1982 by Michael E. Fischer and Steven L. Salzberg of Yale University. From the article:

The majority-finding algorithm uses one of its registers for temporary storage of a single item from the stream; this item is the current candidate for majority element. The second register is a counter initialized to 0. For each element of the stream, we ask the algorithm to perform the following routine. If the counter reads 0, install the current stream element as the new majority candidate (displacing any other element that might already be in the register). Then, if the current element matches the majority candidate, increment the counter; otherwise, decrement the counter. At this point in the cycle, if the part of the stream seen so far has a majority element, that element is in the candidate register, and the counter holds a value greater than 0. What if there is no majority ele­ment? Without making a second pass through the data—which isn't possible in a stream environment—the algorithm cannot always give an unambiguous answer in this circumstance. It merely promises to correctly identify the majority element if there is one.

It can also be extended to find the top N with more memory but this should solve it for the mode.

hackartist
  • 4,968
  • 4
  • 29
  • 47
  • 4
    That is an interesting algorithm, but unless I'm missing something, while all majority values will be modes, not all modes will be majority values. – jkebinger Dec 05 '13 at 22:19
  • The link has died, so I'm glad the description is included. BUT, as described, the counter only increments if the majority candidate 2nd occurrence is adjacent to the 1st occurrence. Which IMPLIES sorted data. Which is NOT guaranteed in the online (streaming) data case. With randomly ordered data, this is unlikely to find any modes. – Jesse Chisholm Feb 25 '19 at 02:18
1

Ultimately if you have no a priori parametric knowledge of the distribution I think you have to store all the values.

That said unless you are dealing with some sort of pathological situation, the remedian (Rousseuw and Bassett 1990) may well be good enough for your purposes.

Very simply it involves calculating the median of batches of medians.

0

median and mode can't be calculated online using only constant space available. However, because median and mode are anyway more "descriptive" than "quantitative", you can estimate them e.g. by sampling the data set.

If the data is normal distributed in the long run, then you could just use your mean to estimate the median.

You can also estimate median using the following technique: establish a median estimation M[i] for every, say, 1,000,000 entries in the data stream so that M[0] is the median of the first one million entries, M[1] the median of the second one million entries etc. Then use the median of M[0]...M[k] as the median estimator. This of course saves space, and you can control how much you want to use space by "tuning" the parameter 1,000,000. This can be also generalized recursively.

Antti Huima
  • 23,825
  • 2
  • 50
  • 67
0

OK dude try these:

for c++:

double skew(double* v, unsigned long n){
    double sigma = pow(svar(v, n), 0.5);
    double mu = avg(v, n);

    double* t;
    t = new double[n];

    for(unsigned long i = 0; i < n; ++i){
        t[i] = pow((v[i] - mu)/sigma, 3);
    }

    double ret = avg(t, n);

    delete [] t;
    return ret;
}

double kurt(double* v, double n){
    double sigma = pow(svar(v, n), 0.5);
    double mu = avg(v, n);

    double* t;
    t = new double[n];

    for(unsigned long i = 0; i < n; ++i){
        t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3;
    }

    double ret = avg(t, n);

    delete [] t;
    return ret;
}

where you say you can already calculate sample variance (svar) and average (avg) you point those to your functions for doin that.

Also, have a look at Pearson's approximation thing. on such a large dataset it would be pretty similar. 3 (mean − median) / standard deviation you have median as max - min/2

for floats mode has no meaning. one would typically stick them in bins of a sginificant size (like 1/100 * (max - min)).

peter
  • 41
  • 2
0

This problem was solved by Pebay et al:

https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf

user14717
  • 3,854
  • 2
  • 26
  • 61
-1
for j in range (1,M):
    y=np.zeros(M) # build the vector y
    y[0]=y0

    #generate the white noise
    eps=npr.randn(M-1)*np.sqrt(var)

    #increment the y vector
    for k in range(1,T):
        y[k]=corr*y[k-1]+eps[k-1]

    yy[j]=y

list.append(y)
Michele La Ferla
  • 6,265
  • 11
  • 44
  • 75
-1

I would tend to use buckets, which could be adaptive. The bucket size should be the accuracy you need. Then as each data point comes in you add one to the relevant bucket's count. These should give you simple approximations to median and kurtosis, by counting each bucket as its value weighted by its count.

The one problem could be loss of resolution in floating point after billions of operations, i.e. adding one does not change the value any more! To get round this, if the maximum bucket size exceeds some limit you could take a large number off all the counts.

dan
  • 1