0

Not really a coding question, more of a how can I do this question, so no code snippet.

In my database, imagine a long list of unsorted numbers.

nums = [9, 12, 15, 18, 22, 100, 1 , 4, 3, 2]
This gives me a median of 10.5

But now imagine my list is much longer, [9, 12, 15, 18, 22, 100, 1 , 4, 3, 2, ......] And every day, I introduce a new number to this list x. The list is stored in a database and I want to avoid hitting the database to get all of this data and then calculating the median.

Are there any tricks where I don't have to call all of the data every day to calculate the median for today after a new number is introduced?

Thanks for any ideas!

WhitneyChia
  • 638
  • 1
  • 8
  • 23
  • Possible duplicate of [Find running median from a stream of integers](https://stackoverflow.com/questions/10657503/find-running-median-from-a-stream-of-integers) – High Performance Mark Oct 09 '17 at 15:25

1 Answers1

0

You don't need all individual values for calculating a median. If you have an initial guess for an interval where the median should lie (e.g. between 5 and 20), you can split the values:

  • LOW: count the values below the interval (x <= 5), giving a count of 4.
  • CENTER: query the values within the interval (5 < x < 20), giving 9, 12, 15, 18.
  • HIGH: count the values above the interval (x >= 20), giving a count of 2.

As the LOW count is two more that the HIGH count, delete the two highest values from CENTER, and compute the median of the remaining values.

If the count difference doesn't leave any numbers in CENTER, you have to change the interval and retry.

With proper indexing of the database column, the three queries should be quite fast, and the resulting amount of data should not create too much traffic between database and client software.

A variant not needing an initial guess could be to count the values by bins of e.g. 5 (trunc(x/5)), giving:

  • 0...4: count=4
  • 5...9: count=1
  • 10...14: count=1
  • 15...19: count=2
  • 20...24: count=1
  • 100...104: count=1

If the median count is reached within a bin, you query the numbers from that bin and compute their median. But in our example it's just between the 5...9 and the 10...14 bin, so both bins must be queried (5 <= x <= 14) and the median taken from the (two) resulting values 9 and 12, giving 10.5.

Ralf Kleberhoff
  • 5,711
  • 1
  • 8
  • 6