63

I want to understand "median of medians" algorithm on the following example:

We have 45 distinct numbers divided into 9 group with 5 elements each.

48 43 38 33 28 23 18 13 8

49 44 39 34 29 24 19 14 9 

50 45 40 35 30 25 20 15 10

51 46 41 36 31 26 21 16 53

52 47 42 37 32 27 22 17 54
  1. The first step is sorting every group (in this case they are already sorted)
  2. Second step recursively, find the "true" median of the medians (50 45 40 35 30 25 20 15 10) i.e. the set will be divided into 2 groups:

    50 25
    
    45 20 
    
    40 15
    
    35 10
    
    30
    

    sorting these 2 groups

    30 10
    
    35 15 
    
    40 20
    
    45 25
    
    50
    

the medians is 40 and 15 (in case the numbers are even we took left median) so the returned value is 15 however "true" median of medians (50 45 40 35 30 25 20 15 10) is 30, moreover there are 5 elements less then 15 which are much less than 30% of 45 which are mentioned in wikipedia

and so T(n) <= T(n/5) + T(7n/10) + O(n) fails.

By the way in the Wikipedia example, I get result of recursion as 36. However, the true median is 47.

So, I think in some cases this recursion may not return true median of medians. I want to understand where is my mistake.

crisron
  • 333
  • 1
  • 5
  • 18
simon
  • 1,335
  • 1
  • 12
  • 23
  • 3
    @kaoD: Site community policy, "Admit that the question is homework." See: http://meta.stackexchange.com/a/10812 – Orbling Feb 28 '12 at 20:28
  • 4
    @kaoD: Nothing essentially wrong with posting a homework question, but it effects how most members answer the question. So it should be stated as such, and what progress has been made shown. Answers are usually attempts to guide, rather than to solve. – Orbling Feb 28 '12 at 20:29
  • 19
    @Orbling is that relevant? Whatever the reason behind this question, smnvhn (as well as others) will be able to learn from a good answer. I think the question in itself already shows that smnvhn has already put some thought into this. As such, if this is turns out to be indeed a homework assignment, the poster will learn more by any posted remarks or answers. – Joris Feb 28 '12 at 20:31
  • 7
    @Orbling no it is not an homework, I just come to this question reading this book "Introduction to Algorithms" by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. – simon Feb 28 '12 at 20:32
  • 1
    @smnvhn: Because it looks like a question from a book, which it is, an interesting book too, you can understand why I might think it was homework. It is standard practice to ask, just for clarity in that case - no offence is meant, thank you for clarifying. :-) – Orbling Feb 28 '12 at 20:38

2 Answers2

37

The problem is in the step where you say to find the true median of the medians. In your example, you had these medians:

50 45 40 35 30 25 20 15 10

The true median of this data set is 30, not 15. You don't find this median by splitting the groups into blocks of five and taking the median of those medians, but instead by recursively calling the selection algorithm on this smaller group. The error in your logic is assuming that median of this group is found by splitting the above sequence into two blocks

50 45 40 35 30

and

25 20 15 10

then finding the median of each block. Instead, the median-of-medians algorithm will recursively call itself on the complete data set 50 45 40 35 30 25 20 15 10. Internally, this will split the group into blocks of five and sort them, etc., but it does so to determine the partition point for the partitioning step, and it's in this partitioning step that the recursive call will find the true median of the medians, which in this case will be 30. If you use 30 as the median as the partitioning step in the original algorithm, you do indeed get a very good split as required.

Hope this helps!

templatetypedef
  • 328,018
  • 92
  • 813
  • 992
  • 12
    I couldn't understand from the part where you try to tell the difference between smnvhn's error and "internal split into blocks of five". How are they different? Could you continue on with smnvhn's example after you describe his error? What I understand is that after recursion on the new array, the array will again be divided in groups of five as smnvhn says and thus it would pass [40, 15] again in the next recursion, so then again 15 will be returned. –  Dec 08 '13 at 02:33
  • 4
    moreover in this example finding partition will not help, since the array is already sorted, and so whichever of the 9 elements you choose, your array will remain unchanged. –  Dec 08 '13 at 02:35
  • 1
    @templatetypedef, could you please elaborate on this. Looks like recursive approach is wrong, because it does exactly the same what author tried in the question. – Oleg Yaroshevych Jan 12 '15 at 15:56
  • @templatetypedef I accidentally down voted your answer. Stack Overflow is not allowing me to revert now with the message "Your vote is now locked in unless this answer is edited". Can you do some minor edit so that I can upvote? – sultan.of.swing Jan 22 '15 at 10:58
34

Here is the pseudocode for median of medians algorithm (slightly modified to suit your example). The pseudocode in wikipedia fails to portray the inner workings of the selectIdx function call.

I've added comments to the code for explanation.

// L is the array on which median of medians needs to be found.
// k is the expected median position. E.g. first select call might look like:
// select (array, N/2), where 'array' is an array of numbers of length N

select(L,k)
{

    if (L has 5 or fewer elements) {
        sort L
        return the element in the kth position
    }

    partition L into subsets S[i] of five elements each
        (there will be n/5 subsets total).

    for (i = 1 to n/5) do
        x[i] = select(S[i],3)

    M = select({x[i]}, n/10)

    // The code to follow ensures that even if M turns out to be the
    // smallest/largest value in the array, we'll get the kth smallest
    // element in the array

    // Partition array into three groups based on their value as
    // compared to median M

    partition L into L1<M, L2=M, L3>M

    // Compare the expected median position k with length of first array L1
    // Run recursive select over the array L1 if k is less than length
    // of array L1
    if (k <= length(L1))
        return select(L1,k)

    // Check if k falls in L3 array. Recurse accordingly
    else if (k > length(L1)+length(L2))
        return select(L3,k-length(L1)-length(L2))

    // Simply return M since k falls in L2
    else return M

}

Taking your example:

The median of medians function will be called over the entire array of 45 elements like (with k = 45/2 = 22):

median = select({48 49 50 51 52 43 44 45 46 47 38 39 40 41 42 33 34 35 36 37 28 29 30 31 32 23 24 25 26 27 18 19 20 21 22 13 14 15 16 17 8 9 10 53 54}, 45/2)
  1. The first time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 50 45 40 35 30 20 15 10. In this call, n = 45, and hence the select function call will be M = select({50 45 40 35 30 20 15 10}, 4)

  2. The second time M = select({x[i]}, n/10) is called, array {x[i]} will contain the following numbers: 40 20. In this call, n = 9 and hence the call will be M = select({40 20}, 0). This select call will return and assign the value M = 20.

    Now, coming to the point where you had a doubt, we now partition the array L around M = 20 with k = 4.

    Remember array L here is: 50 45 40 35 30 20 15 10.

    The array will be partitioned into L1, L2 and L3 according to the rules L1 < M, L2 = M and L3 > M. Hence:
    L1: 10 15
    L2: 20
    L3: 30 35 40 45 50

    Since k = 4, it's greater than length(L1) + length(L2) = 3. Hence, the search will be continued with the following recursive call now:
    return select(L3,k-length(L1)-length(L2))

    which translates to:
    return select({30 35 40 45 50}, 1)

    which will return 30 as a result. (since L has 5 or fewer elements, hence it'll return the element in kth i.e. 1st position in the sorted array, which is 30).

Now, M = 30 will be received in the first select function call over the entire array of 45 elements, and the same partitioning logic which separates the array L around M = 30 will apply to finally get the median of medians.

Phew! I hope I was verbose and clear enough to explain median of medians algorithm.

sultan.of.swing
  • 1,076
  • 1
  • 11
  • 23
  • 2
    I think this answer deserves to at least go up by votes. – Milad Naseri Jul 13 '15 at 02:53
  • 1
    I looked for a median of median calculation and found this thread. I tried to rebuild the pseudocode in java, but i get an exception because of the array length in the second call of select... Can someone explain what the x[i] and the {x[i]} means? and which size it should have? Thank you! – D. Müller Sep 10 '15 at 22:00
  • 2
    Downvoted as the variables are all one letter, thus making the code much more difficult to follow. – Rick Mac Gillis Oct 17 '15 at 20:59
  • 3
    @RickMacGillis I would consider single letter variables a good thing here. Sure, the comment at the start should explain variables `M` and `x` but otherwise this is similar to any math book. You never see `1 + an_unknown_value = 3` instead of `1 + x = 3` in any math book either. – Mikko Rantalainen Sep 22 '16 at 05:00
  • 1
    can't believe this algo is O(n)! – Varun Garg Oct 15 '16 at 10:55
  • If we are writing such massive code for finding median, we should sort those 5 numbers too with minimum comparisions: https://stackoverflow.com/questions/1534748/design-an-efficient-algorithm-to-sort-5-distinct-keys-in-fewer-than-8-comparison – Varun Garg Oct 15 '16 at 11:06