21

I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.

If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?

Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.

Mau
  • 13,256
  • 2
  • 28
  • 49
  • 2
    Very interesting question. If you only need to know the median to a certain precision, and you expect that the probability distribution doesn't change over the sampling time, you can estimate the "99% confidence interval" of your median early on, and store only numbers within that interval (and keep track of the ones outside the interval that your discard). This will be more efficient when N is very large - but it does depend on your required precision of the result. – Floris Jan 12 '14 at 17:20

4 Answers4

10

You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.

Edit:

If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.

deinst
  • 16,749
  • 3
  • 43
  • 45
  • No, I do not think so. With this n = 13, and we only need to store at most 7. I'm not sure what your n is. With this stream we read in the first 7, and then throw out zeros as we read the 2's. I really do not understand your objection. – deinst Jul 30 '10 at 14:05
  • OK, I read the question as a stream of unknown length, but now I realize that wasn't stated... Either way `13/2==6` for me :) Anyways, this is a true observation. Unfortunately, I can't reverse the -1, because I didn't do it. And `n/2` is still `O(n)` :) – Stephen Jul 30 '10 at 14:10
  • deinst: could you please help me to know how you are going to find median for this list with saveing first n/2 points: 0,3,2,1,5,6,8,7,4 – mhshams Jul 30 '10 at 14:16
  • Keep at most 5 points, because ceil(9/2)==5: `[0], [0,3], [0,2,3], [0,1,2,3], [0,1,2,3,5], (1)[1,2,3,5,6], (2)[2,3,5,6,8], (3)[3,5,6,7,8], (3)[3,4,5,6,7](1)`. 5th item is 4. (0,1,2,3,4,5,6,7,8) -> middle item is 4. – Stephen Jul 30 '10 at 14:21
  • Thanks Stephen. that is less muddled than mine was. – deinst Jul 30 '10 at 14:24
  • If there is a good chance that an average sample is going to be repeated more than 2 times, you could save some memory by storing the number and a count, basically Run Length Encoding the values you are keeping. – Dolphin Jul 30 '10 at 16:06
2

You can

  • Use statistics, if that's acceptable - for example, you could use sampling.
  • Use knowledge about your number stream
    • using a counting sort like approach: k distinct values means storing O(k) memory)
    • or toss out known outliers and keep a (high,low) counter.
    • If you know you have no duplicates, you could use a bitmap... but that's just a smaller constant for O(n).
Stephen
  • 42,960
  • 7
  • 57
  • 67
1

If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.

Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.

If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.

Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.

Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.

However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.

If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.

Michael J
  • 6,879
  • 1
  • 21
  • 28
1

I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.

If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.

For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.

For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.

If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.

Jimmy Chen
  • 31
  • 4