6

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ... I don't see any better way than going through each sub range and run the standard median finding algorithms.

A simple example: [5 3 6 2 4] The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)

dabei
  • 127
  • 1
  • 7
  • 1
    If the list is sorted, then just get element number 50, 112, 700, etc? – Blorgbeard Aug 12 '13 at 03:43
  • 2
    Use a selection algorithm (http://en.wikipedia.org/wiki/Selection_algorithm)... there are several from which to select. – andand Aug 12 '13 at 03:50
  • 1
    The list is not sorted. And I'm mostly interested in avoiding duplicate work in find medians in overlapping sub ranges. – dabei Aug 12 '13 at 03:52
  • @Blorgbeard Nothing says that all possible values in the given range are elements. The median of `0,1,2,100` is not 50. – Bernhard Barker Aug 12 '13 at 08:42
  • @Dukeling 50 is not even in that list.. I said get the element in the middle of the range. – Blorgbeard Aug 12 '13 at 10:31
  • So it seems there are 4 possible interpretations (take a list `5,2,7,3,6` with range `[5,7]` as example) - **(1)** A trivially simple problem of finding the middle value based on index. The result here will be `5,2,7` -> `2`. **(2)** Same as 1., except that we sort the numbers first (@Blorgbeard is this what you meant?), so `5,2,7` -> `2,5,7` -> `5`. **(3)** Extract all values in that range, return the middle one, which would be `5,7,6` -> `7` in the example. **(4)** Same as 3., except sorted first, which would be `5,7,6` -> `5,6,7` -> `6` in the example (classic mean definition, my answer). – Bernhard Barker Aug 12 '13 at 11:07
  • 1
    @Dukeling yes, **(2)** is what I meant: median = middle element of sorted list. Your answer seems to be calculating the mean of the first and last elements in the range? – Blorgbeard Aug 12 '13 at 21:32
  • @dabei By `0..3` you mean `0..2`, right? :) – Blorgbeard Aug 13 '13 at 05:12
  • @Blorgbeard My answer calculates **(4)**, which is still a median, just making different assumptions about the elements that must be used. – Bernhard Barker Aug 13 '13 at 08:12

4 Answers4

2

INTEGER ELEMENTS:

If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:

  1. create buckets for each number lies in any of your sub-ranges.
  2. iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
  3. compute the medians based on the aggregated values stored in your buckets.

Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.

The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.

You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.


FLOATING-POINT ELEMENTS

The entire elements can fit into memory:

If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.

The entire elements cannot fit into memory but stored in a single machine:

Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.

The entire elements are stored in multiple machines: Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.

Community
  • 1
  • 1
keelar
  • 5,334
  • 6
  • 35
  • 73
  • How do you calculate a median with buckets? Sounds like you're talking about mode rather than median.. – Blorgbeard Aug 12 '13 at 10:43
  • Since `bucket[i]` stores the number of elements equal to `i`, you're able to effectively compute the `N/2th` largest element based on that. Note that this only works for integers. – keelar Aug 12 '13 at 18:22
  • Thanks for the nice write up, but I don't think it would work. I've added an example in the question to clarify. I'm trying to find the median of sub ranges of original list. So any solution involves rearranging the whole list wouldn't work. – dabei Aug 13 '13 at 00:43
0

I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.

In practice, because there will be highly optimized sort routines you can call.

In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.

If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.

So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.

mcdowella
  • 18,736
  • 2
  • 17
  • 24
0

My idea:

  • Sort the list into an array (using any appropriate sorting algorithm)

  • For each range, find the indices of the start and end of the range using binary search

  • Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])

Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine

Time per query: O(log n)

Dynamic list:

The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.

Bernhard Barker
  • 50,899
  • 13
  • 85
  • 122
  • I don't think this works.. the median of the numbers in the sorted list between elements x and y is arr[(x+y)/2], but why does that map to the unsorted list just because the start and end elements were mapped? E.g. `5,2,7,3,6`. Sorted: `2,3,5,6,7`. Let's find the median of the first 3 (i.e. `5,2,7` - answer is 5). In the sorted list, the range between `5` and `7` is `5,6,7`, so your algorithm would say the answer is `6`? – Blorgbeard Aug 12 '13 at 10:41
  • It seems there's a discrepancy between our understanding of the question. Will attempt to clarify. – Bernhard Barker Aug 12 '13 at 10:53
0

The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.

As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.

All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

andand
  • 15,638
  • 9
  • 48
  • 76