Find median of a large integer set using MapReduce

Question

Is there a fast algorithm to run on MapReduce framework to find the Median from a huge Integer set?

Possible duplicate of [Computing median in map reduce](http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce) — Jakub Kukul, Jul 22 '16 at 13:01

Louis Wasserman · Answer 1 · 2012-06-11T15:01:13.410

5

Here's how I would do it. This is a sort of parallel version of sequential quickselect. (Some map/reduce tools might not let you do things quite as easily...)

Pick a small, arbitrary chunk of the input set. Sort this sequentially. We're going to use these as a whole bunch of pivots, in parallel. Call this array pivots, and let its size be k.

Perform a map/reduce as follows: for each value x in the input set, binary-search to find x's position relative to pivots; call this position bucket(x). This is an integer between 0 and k. The reduce step is to count the number of elements in each bucket; define bucket[b] to be the number of x with bucket(x) = b.

The median must be in the "median bucket." Pick out all the values in that median bucket, and use a traditional sequential selection algorithm to find the element with the correct index.

edited Jun 11 '12 at 15:01

answered Oct 06 '11 at 18:59

Louis Wasserman

172,699
23
307
375

1

Assuming I understand this algorithm correctly, there are several issues with it. First of all, you should pick statistically significant number of pivots. Second, there is no guarantee that the median will be in the "median bucket". As an example, if you have 3 reducers and you chose your pivots to be, after sorting, first and second numbers, all your numbers will end up in the third reducer, and clearly not the median reducer, which will have nothing in it. Thirdly, if you send everything in bucket i to reducer i, there is no guarantee that it will fit in memory. – delmet Jun 11 '12 at 01:09
And by "you should pick statistically significant number of pivots", I meant you should pick your pivots from a much larger number of sample than the number of reducers, even if you decide to have a few pivots. This might have been implicit in your algorithm. But the number of pivots should not be a function of the number of reducers, but a function of your memory and input size. I suppose by "median bucket", you might have meant the bucket the median is in, which you figure out from the bucket counts. Even then the 3rd objection remains. – delmet Jun 11 '12 at 01:42
1

"I suppose by "median bucket", you might have meant the bucket the median is in, which you figure out from the bucket counts" Correct, obviously; why else would we need the count step? "But the number of pivots should not be a function of the number of reducers, but a function of your memory and input size." The number of pivots is totally flexible here; I didn't mean to imply otherwise. – Louis Wasserman Jun 11 '12 at 02:36
"Pick a small, arbitrary chunk of the input set, around the same size as the number of available machines...We're going to use these as a whole bunch of pivots" This directly says otherwise. At any rate, how does your "median bucket" know it is the median bucket? – delmet Jun 11 '12 at 04:34
...After we get the bucket counts, we can figure that out, no? – Louis Wasserman Jun 11 '12 at 15:00
Actually, that is the main complication in this algorithm. The simplest way is to label the buckets from 1 to k, whereby each bucket prints out its label, count, and values. Then a simple step to figure out the counts and the relevant bucket. After that you filter out everything but the median bucket, and find the median. The problem is that sorting is easy in MR, but labeling is hard (or not so easy). At any rate, all this discussion added value to the answer. – delmet Jun 11 '12 at 18:57

Find median of a large integer set using MapReduce

1 Answers1

Linked