4

I've searched around the web and visited the wiki page for the Median of median algorithm. But can't seem to find an explicit statement to my question:

If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?

Furthermore is this statement also correct for any of the kth statistics? I'd be interested in links to research etc in this area.

Jared Krumsie
  • 387
  • 1
  • 8
  • 17
  • 3
    why was this question downvoted? – Jared Krumsie Dec 12 '11 at 02:46
  • 1
    This question would have been perfect for the upcoming [Computer Science Stack Exchange](http://area51.stackexchange.com/proposals/35636/computer-science-non-programming?referrer=pdx8p7tVWqozXN85c5ibxQ2). So, if you like to have a place for questions like this one, please go ahead and help this proposal to take off! – Raphael Dec 12 '11 at 11:46

2 Answers2

12

The answer to your question is no.

If you want to understand how to actually select the k-th order statistics (including the median of course) in a parallel setting (distributed setting is of course not really different), take a look at this recent paper, in which I proposed a new algorithm improving the previous state of the art algorithm for parallel selection:

Deterministic parallel selection algorithms on coarse-grained multicomputers

Here, we use two weighted 3-medians as pivots, and partition around these pivots using five-way partitioning. We also implemented and tested the algorithm using MPI. Results are very good, taking into account that this is a deterministic algorithm exploiting the worst-case O(n) selection algorithm. Using the randomized O(n) QuickSelect algorithm provides an extremely fast parallel algorithm.

Massimo Cafaro
  • 25,154
  • 14
  • 76
  • 92
7

If one has a very very large list of integers (TBs in size) and wants to find the median of this list in a distributed manner, would breaking the list up into sub lists of varying sizes (or equal doesn't really matter), then proceed to compute the medians of those smaller sub-lists, then compute the median of those medians result in the median of the original large list?

No. The actual median of the entire list is not necessarily a median of any of the sublists.

Median-of-medians can give you a good choice of pivot for quickselect by virtue of being nearer the actual median than a randomly selected element, but you would have to do the rest of the quickselect algorithm to locate the actual median of the larger list.

Don Roby
  • 39,169
  • 6
  • 84
  • 105
  • So the quickselect part has to be the computation run on each node, my stumbling part is how one should go about merging the results, if that is at all possible. – Jared Krumsie Dec 12 '11 at 19:37