44

I'm trying to calculate the median of a set of values, but I don't want to store all the values as that could blow memory requirements. Is there a way of calculating or approximating the median without storing and sorting all the individual values?

Ideally I would like to write my code a bit like the following

var medianCalculator = new MedianCalculator();
foreach (var value in SourceData)
{
  medianCalculator.Add(value);
}
Console.WriteLine("The median is: {0}", medianCalculator.Median);

All I need is the actual MedianCalculator code!

Update: Some people have asked if the values I'm trying to calculate the median for have known properties. The answer is yes. One value is in 0.5 increments from about -25 to -0.5. The other is also in 0.5 increments from -120 to -60. I guess this means I can use some form of histogram for each value.

Thanks

Nick

Noha Kareem
  • 1,718
  • 1
  • 20
  • 31
Nick Randell
  • 16,403
  • 17
  • 57
  • 67
  • Are you looking for a general purpose algorithm (in which case this seems impossible) or does your input have some properties that could be used for designing a specific solution? – mouviciel Mar 12 '09 at 10:44
  • A C++ implementation by http://stackoverflow.com/users/25188/john-d-cook "Calculating Percentiles in Memory-bound Applications", http://www.codeproject.com/KB/recipes/TailKeeper.aspx See also http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewnes/ – Alessandro Jacopson Jun 07 '12 at 19:04
  • You can keep a running sample (e.g. using reservoir sampling) of whatever size you are comfortable with. Then, in the end, you can calculate the median of the sample. (This can also be extended with running median if you want to have quick access to it during your read.) – Thomas Ahle Sep 14 '18 at 09:56

10 Answers10

45

If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.

David Z
  • 116,302
  • 26
  • 230
  • 268
  • I like this answer. It's a great idea as I know the types of values being stored and can construct a histogram reasonably easily – Nick Randell Mar 12 '09 at 10:41
41

There is the 'remedian' statistic. It works by first setting up k arrays, each of length b. Data values are fed in to the first array and, when this is full, the median is calculated and stored in the first pos of the next array, after which the first array is re-used. When the second array is full the median of its values is stored in the first pos of the third array, etc. etc. You get the idea :)

It's simple and pretty robust. The reference is here...

http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Remedian.pdf

Hope this helps

Michael

cberzan
  • 1,809
  • 1
  • 17
  • 31
michael
  • 675
  • 7
  • 12
  • 1
    there is also a newer paper: https://www.sciencedirect.com/science/article/pii/S0304397513004519 – S.A. Apr 07 '18 at 09:43
21

I use these incremental/recursive mean and median estimators, which both use constant storage:

mean += eta * (sample - mean)
median += eta * sgn(sample - median)

where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)

This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.

Also, I modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.

I would love to see an incremental mode estimator of a similar form...

(Note: I also posted this to a similar topic here: "On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?)

Tyler Streeter
  • 1,154
  • 10
  • 13
  • 5
    That median estimation does not work for non-Gaussian distributions. Imagine having a distribution where you have only two numbers: x seen n times, and x+y seen n+m times. If either m or y are relatively large, the estimation breaks. Example: you see 100 10m+1 times, then 1 100k times. True median = 100. Estimated median = 1. – Eli Sep 13 '13 at 22:48
  • Of course, it is still an estimation. Note that the problem can be overcome with a high number of iterations n. For n going to infinity the error will converge to eta, but even that might be overcome, by gradually reducing eta. – Tim Kuipers Nov 25 '13 at 21:23
  • This was exactly what I needed! So simple! Thanks! – Ihle Jan 11 '19 at 12:55
  • Regarding the mean, why use an approximation instead of the exact recursive formula `mean += (sample - mean) / N` (see [better description here](https://math.stackexchange.com/a/106720/65103))? – bluenote10 May 23 '20 at 08:58
  • @bluenote10 Good point. Using a constant *eta* is good for tracking non-stationary sources, whereas *eta* = 1/n (equivalent to what you wrote) is good for the stationary case. I have updated the post to reflect this distinction. – Tyler Streeter May 23 '20 at 14:53
14

Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are

  1. You have limited memory, say O(log n) where n is the number of items you want
  2. You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.

The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.

The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.

This algorithm uses O(log n) additional space and runs in Linear time.

Regexident
  • 29,108
  • 10
  • 91
  • 98
Pall Melsted
  • 338
  • 1
  • 12
  • Hi, could you please explain what are `a` and `p`, what they represent ? By the way the link you provided is nowadays out. – user7364588 Mar 02 '19 at 15:37
  • You can choose a and p small enough, this will mean you get a better approximation with higher probability, but the time complexity is increased. – Pall Melsted Apr 15 '19 at 22:43
7

This is tricky to get right in general, especially to handle degenerate series that are already sorted, or have a bunch of values at the "start" of the list but the end of the list has values in a different range.

The basic idea of making a histogram is most promising. This lets you accumulate distribution information and answer queries (like median) from it. The median will be approximate since you obviously don't store all values. The storage space is fixed so it will work with whatever length sequence you have.

But you can't just build a histogram from say the first 100 values and use that histogram continually.. the changing data may make that histogram invalid. So you need a dynamic histogram that can change its range and bins on the fly.

Make a structure which has N bins. You'll store the X value of each slot transition (N+1 values total) as well as the population of the bin.

Stream in your data. Record the first N+1 values. If the stream ends before this, great, you have all the values loaded and you can find the exact median and return it. Else use the values to define your first histogram. Just sort the values and use those as bin definitions, each bin having a population of 1. It's OK to have dupes (0 width bins).

Now stream in new values. For each one, binary search to find the bin it belongs to. In the common case, you just increment the population of that bin and continue. If your sample is beyond the histogram's edges (highest or lowest), just extend the end bin's range to include it. When your stream is done, you find the median sample value by finding the bin which has equal population on both sides of it, and linearly interpolating the remaining bin-width.

But that's not enough.. you still need to ADAPT the histogram to the data as it's being streamed in. When a bin gets over-full, you're losing information about that bin's sub distribution. You can fix this by adapting based on some heuristic... The easiest and most robust one is if a bin reaches some certain threshold population (something like 10*v/N where v=# of values seen so far in the stream, and N is the number of bins), you SPLIT that overfull bin. Add a new value at the midpoint of the bin, give each side half of the original bin's population. But now you have too many bins, so you need to DELETE a bin. A good heuristic for that is to find the bin with the smallest product of population and width. Delete it and merge it with its left or right neighbor (whichever one of the neighbors itself has the smallest product of width and population.). Done! Note that merging or splitting bins loses information, but that's unavoidable.. you only have fixed storage.

This algorithm is nice in that it will deal with all types of input streams and give good results. If you have the luxury of choosing sample order, a random sample is best, since that minimizes splits and merges.

The algorithm also allows you to query any percentile, not just median, since you have a complete distribution estimate.

I use this method in my own code in many places, mostly for debugging logs.. where some stats that you're recording have unknown distribution. With this algorithm you don't need to guess ahead of time.

The downside is the unequal bin widths means you have to do a binary search for each sample, so your net algorithm is O(NlogN).

SPWorley
  • 10,852
  • 9
  • 41
  • 61
  • Thanks - this is a good answer, but may be too expensive for my requirements. I need to have lots of medians over a large geographic area, with different medians for each 200m by 200m area. – Nick Randell Mar 13 '09 at 06:52
  • Some good ideas, but I'm stuck on one thing -- when you split a bin in two, how do you decide how many from that bin go into sub-bin #1 and how many go into sub-bin #2? It seems you would need to record every value in the bin (since a bin may subdivide many times). – j_random_hacker Apr 09 '09 at 18:30
  • JRH: You split in the middle and assign half of the population to each bin. We're not storing any more information about the inner subbin data distribution to do much else, and the split is mostly to allow better data resolution from now on. – SPWorley Apr 09 '09 at 20:48
  • @Arno: Thanks for the clarification. Seems like a lossy but workable approach to me, +1. – j_random_hacker Apr 10 '09 at 05:00
3

I don't think it is possible to do without having the list in memory. You can obviously approximate with

  • average if you know that the data is symmetrically distributed
  • or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
Grzenio
  • 33,623
  • 43
  • 148
  • 226
3

David's suggestion seems like the most sensible approach for approximating the median.

A running mean for the same problem is a much easier to calculate:

Mn = Mn-1 + ((Vn - Mn-1) / n)

Where Mn is the mean of n values, Mn-1 is the previous mean, and Vn is the new value.

In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.

In code this would look something like:

new_mean = prev_mean + ((value - prev_mean) / count)

though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.

GrahamS
  • 9,095
  • 7
  • 45
  • 61
2

Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue Let MedianIndex = (N+1)/2

1st Order Binary Search:

Repeat the following 4 steps until LowValue < HighValue.

  1. Get MedianValue approximately = ( HighValue + LowValue ) / 2

  2. Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K

  3. is K = MedianIndex, then return MedianValue

  4. is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue

It will be faster without consuming memory

2nd Order Binary Search:

LowIndex=1 HighIndex=N

Repeat Following 5 Steps until (LowIndex < HighIndex)

  1. Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)

  2. Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit

  3. Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K

  4. is (K=MedianIndex) ? return MedianValue

  5. is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue

It will be faster than 1st order without consuming memory

We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...

lakshmanaraj
  • 4,102
  • 20
  • 12
  • This is a fine way to do it, but please mention explicitly that it requires multiple (log(n) in expectation) passes through the data since you won't be keeping NumberOfItemsWhichAreLessThanorEqualToMedianValue[k] in RAM. – j_random_hacker Apr 09 '09 at 18:39
0

Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp

This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.

If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).

Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf

Erik Aronesty
  • 8,927
  • 4
  • 46
  • 29
0

I picked up the idea of iterative quantile calculation. It is important to have a good value for starting point and eta, these may come from mean and sigma. So I programmed this:

Function QuantileIterative(Var x : Array of Double; n : Integer; p, mean, sigma : Double) : Double;
Var eta, quantile,q1, dq : Double;
    i : Integer;
Begin
  quantile:= mean + 1.25*sigma*(p-0.5);
  q1:=quantile;
  eta:=0.2*sigma/xy(1+n,0.75); // should not be too large! sets accuracy
  For i:=1 to n Do 
     quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
  dq:=abs(q1-quantile);
  If dq>eta
     then Begin
          If dq<3*eta then eta:=eta/4;
          For i:=1 to n Do 
             quantile := quantile + eta * (signum_smooth(x[i] - quantile,eta) + 2*p - 1);
     end;
  QuantileIterative:=quantile
end;

As the median for two elements would be the mean, I used a smoothed signum function, and xy() is x^y. Are there ideas to make it better? Of course if we have some more a-priori knowledge we can add code using min and max of the array, skew, etc. For big data you would not use an array perhaps, but for testing it is easier.

kenlukas
  • 2,735
  • 8
  • 19
  • 27
user32038
  • 131
  • 4