3

There is something I don't understand about the algorithm of median of medians. One key step about this algorithm is to find an approximate median, and according to Wikipedia, we have the guarantee that this approximate median is greater than 30% of elements of the initial set.

To find this approximate median, we compute the median of each group of 5 elements, we gather these medians in a new set, and we recompute the medians until the obtained set have least than 5 elements. In this case, we get the median of the set. (see the wikipedia page if my explanations are not clear)

But, consider the following set of 125 elements :

1 2 3 1001 1002
4 5 6 1003 1004
7 8 9 1005 1006
1020 1021 1022 1023 1034 
1025 1026 1027 1028 1035 

10 11 12 1007 1008
13 14 15 1009 1010
16 17 18 1011 1013
1029 1030 1031 1032 1033 
1036 1037 1038 1039 1040 

19 20 21 1014 1015
22 23 24 1016 1017
25 26 27 1018 1019
1041 1042 1043 1044 1045
1046 1047 1048 1049 1050

1051 1052 1053 1054 1055
1056 1057 1058 1059 1060
1061 1062 1063 1064 1065
1066 1067 1068 1069 1070
1071 1072 1073 1074 1075

1076 1077 1078 1079 1080
1081 1082 1083 1084 1085
1086 1087 1088 1089 1090
1091 1092 1093 1094 1095
1096 1097 1098 1099 1100 

So we divide the set in group of 5 elements, we compute and gather the medians, and so, we obtain the following set :

3 6 9 1022 1207
12 15 18 1031 1038
21 24 27 1043 1048
1053 1058 1063 1068 1073
1078 1083 1088 1093 1098

We redo the same algorithm, and we obtain the following set :

9 18 27 1063 1068

So we obtain that the approximate median is 27. But this number is greater or equals than only 27 elements. And 27/125 = 21.6% < 30%!!

So my question is : where am I wrong?? Why is the approximate median is in my case not greater than 30% of elements????

Thank you for your replies!!

Truc Truca
  • 69
  • 6

2 Answers2

7

The cause of your confusion about the median-of-medians algorithm is that, while median-of-medians returns an approximate result within 20% of the actual median, at some stages in the algorithm we also need to calculate exact medians. If you mix up the two, you will not get the expected result, as demonstrated in your example.

Median-of-medians uses three functions as its building blocks:

medianOfFive(array, first, last) {
    // ...
    return median;
}

This function returns the exact median of five (or fewer) elements from (part of) an array. There are several ways to code this, based on e.g. a sorting network or insertion sort. The details are not important for this question, but it is important to note that this function returns the exact median, not an approximation.

medianOfMedians(array, first, last) {
    // ...
    return median;
}

This function returns an approximation of the median from (part of) an array, which is guaranteed to be larger than the 30% smallest elements, and smaller than the 30% largest elements. We'll go into more detail below.

select(array, first, last, n) {
    // ...
    return element;
}

This function returns the n-th smallest element from (part of) an array. This function too returns an exact result, not an approximation.

At its most basic, the overall algorithm works like this:

medianOfMedians(array, first, last) {
    call medianOfFive() for every group of five elements
    fill an array with these medians
    call select() for this array to find the middle element
    return this middle element (i.e. the median of medians)
}

So this is where your calculation went wrong. After creating an array with the median-of-fives, you then used the median-of-medians function again on this array, which gives you an approximation of the median (27), but here you need the actual median (1038).

This all sounds fairly straightforward, but where it becomes complicated is that the function select() calls medianOfMedians() to get a first estimate of the median, which it then uses to calculate the exact median, so you get a two-way recursion where two functions call each other. This recursion stops when medianOfMedians() is called for 25 elements or fewer, because then there are only 5 medians, and instead of using select() to find their median, it can use medianOfFive().

The reason why select() calls medianOfMedians() is that it uses partitioning to split (part of) the array into two parts of close to equal size, and it needs a good pivot value to do that. After it has partitioned the array into two parts with the elements which are smaller and larger than the pivot, it then checks which part the n-th smallest element is in, and recurses with this part. If the size of the part with the smaller values is n-1, the pivot is the n-th value, and no further recursion is needed.

select(array, first, last, n) {
    call medianOfMedians() to get approximate median as pivot
    partition (the range of) the array into smaller and larger than pivot
    if part with smaller elements is size n-1, return pivot
    call select() on the part which contains the n-th element
}

As you see, the select() function recurses (unless the pivot happens to be the n-th element), but on ever smaller ranges of the array, so at some point (e.g. two elements) finding the n-th element will become trivial, and recursing further is no longer needed.

So finally we get, in some more detail:

medianOfFive(array, first, last) {
    // some algorithmic magic ...
    return median;
}

medianOfMedians(array, first, last) {
    if 5 elements or fewer, call medianOfFive() and return result
    call medianOfFive() for every group of five elements
    store the results in an array medians[]
    if 5 elements or fewer, call medianOfFive() and return result
    call select(medians[]) to find the middle element
    return the result (i.e. the median of medians)
}

select(array, first, last, n) {
    if 2 elements, compare and return n-th element
    if 5 elements or fewer, call medianOfFive() to get median as pivot
    else call medianOfMedians() to get approximate median as pivot
    partition (the range of) the array into smaller and larger than pivot
    if part with smaller elements is size n-1, return pivot
    if n-th value is in part with larger values, recalculate value of n
    call select() on the part which contains the n-th element
}

EXAMPLE

Input array (125 values, 25 groups of five):

 #1    #2    #3    #4    #5    #6    #7    #8    #9    #10   #11   #12   #13   #14   #15   #16   #17   #18   #19   #20   #21   #22   #23   #24   #25

   1     4     7  1020  1025    10    13    16  1029  1036    19    22    25  1041  1046  1051  1056  1061  1066  1071  1076  1081  1086  1091  1096
   2     5     8  1021  1026    11    14    17  1030  1037    20    23    26  1042  1047  1052  1057  1062  1067  1072  1077  1082  1087  1092  1097
   3     6     9  1022  1027    12    15    18  1031  1038    21    24    27  1043  1048  1053  1058  1063  1068  1073  1078  1083  1088  1093  1098
1001  1003  1005  1023  1028  1007  1009  1011  1032  1039  1014  1016  1018  1044  1049  1054  1059  1064  1069  1074  1079  1084  1089  1094  1099
1002  1004  1006  1034  1035  1008  1010  1013  1033  1040  1015  1017  1019  1045  1050  1055  1060  1065  1070  1075  1080  1085  1090  1095  1100

Medians of groups of five (25 values):

3, 6, 9, 1022, 1027, 12, 15, 18, 1031, 1038, 21, 24, 27, 1043,  
1048, 1053, 1058, 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

Groups of five for approximate median:

 #1    #2    #3    #4    #5

   3    12    21  1053  1078
   6    15    24  1058  1083
   9    18    27  1063  1088
1022  1031  1043  1068  1096
1027  1038  1048  1073  1098

Medians of five for approximate median:

9, 18, 27, 1063, 1088

Approximate median as pivot:

27

Medians of five partitioned with pivot 27 (depends on method):

small: 3, 6, 9, 24, 21, 12, 15, 18
pivot: 27
large: 1031, 1038, 1027, 1022, 1043, 1048, 1053, 1058,  
       1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

The smaller group has 8 elements, the larger group 16 elements. We were looking for the middle 13th element out of 25, so now we look for the 13 - 8 - 1 = 4th element out of 16:

Groups of five:

 #1    #2    #3    #4

1031  1048  1073  1098
1038  1053  1078
1027  1058  1083
1022  1063  1088
1043  1068  1093

Medians of groups of five:

1031, 1058, 1083, 1098

Approximate median as pivot:

1058

Range of medians of five partitioned with pivot 1058 (depends on method):

small: 1031, 1038, 1027, 1022, 1043, 1048, 1053
pivot: 1058
large: 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

The smaller group has 7 elements. We were looking for the 4th element of 16, so now we look for the 4th element out of 7:

Groups of five:

 #1    #2

1031  1048
1038  1053
1027
1022
1043

Medians of groups of five:

1031, 1048

Approximate median as pivot:

1031

Range of medians of five partitioned with pivot 1031 (depends on method):

small: 1022, 1027
pivot: 1031
large: 1038, 1043, 1048, 1053

The smaller part has 2 elements, and the larger has 4, so now we look for the 4 - 2 - 1 = 1st element out of 4:

Median of five as pivot:

1043

Range of medians of five partitioned with pivot 1043 (depends on method):

small: 1038
pivot: 1043
large: 1048, 1053

The smaller part has only one element, and we were looking for the first element, so we can return the small element 1038.

As you will see, 1038 is the exact median of the original 25 median-of-fives, and there are 62 smaller values in the original array of 125:

1 ~ 27, 1001 ~ 1011, 1013 ~ 1023, 1025 ~ 1037

which not only puts it in the 30~70% range, but means it is actually the exact median (note that this is a coincidence of this particular example).

  • I also had the same confusion as the OP. I had thought (up until reading your post) that the approximate median is within 20% of the median of the INITIAL ARRAY (i.e., at the very beginning of the program), but it's actually within 20% of the median of the array you passed in, which is not the initial array when you recurse more than 1 level deep. – user5965026 Mar 14 '21 at 19:14
3

I'm completely with your analysis up through the point where you get the medians of each of the blocks of five elements, when you're left with this collection of elements:

3 6 9 1022 1207 12 15 18 1031 1038  21 24 27 1043 1048 1053 1058 1063 1068 1073 1078 1083 1088 1093 1098

You are correct that, at this point, we need to get the median of this collection of elements. However, the way that the median-of-medians algorithm accomplishes this is different than what you've proposed.

When you were working through your analysis, you attempted to get the median of this set of values by, once again, splitting the input into blocks of size five and taking the median of each. However, that approach won't actually give you the median of the medians. (You can see this by noting that you got back 27, which isn't the true median of that collection of values).

The way that the median-of-medians algorithm actually gets back the median of the medians is by recursively invoking the overall algorithm to obtain the median of those elements. This is subtly different from just repeatedly breaking things apart into blocks and computing the medians of each block. In particular, each recursive call will

  • get an estimate of the pivot by using the groups-of-five heuristic,
  • recursively invoke the function on itself to find the median of those medians, then
  • apply a partitioning step on that median and use that to determine how to proceed from there.

This algorithm is, in my opinion, something that's way too complicated to actually trace through by hand. You really need to trust that, since each recursive call you're making works on a smaller array than what you started with, each recursive call will indeed do what it says to do. So when you're left with the medians of each group, as you were before, you should just trust that when you need to get the median by a recursive call, you end up with the true median.

If you look at the true median of the medians that you've generated in the first step, you'll find that it indeed will be between the 30th and 70th percentiles of the original data set.

If this seems confusing, don't worry - you're in really good company. This algorithm is famously tricky to understand. For me, the easiest way to understand it is to just trust that recursion works and to trace through it only one layer deep, working under the assumption that all the recursive calls work, rather than trying to walk all the way down to the bottom of the recursion tree.

templatetypedef
  • 328,018
  • 92
  • 813
  • 992
  • As I understand it from the Wikipedia page, median-of-medians does not recursively call itself on the list of median-of-5s, but it calls the quickselect algorithm, which then calls median-of-medians. The difference is that quickselect returns the actual median, not an approximate median. Is that a correct interpretation? – m69 ''snarky and unwelcoming'' Sep 22 '18 at 23:09
  • The median-of-medians algorithm is separate from quickselect, so it shouldn’t be making any recursive calls to quickselect. (Quickselect is a randomized selection algorithm that chooses pivots at random. One of the reasons median-of-medians was such a big deal when it was discovered was that it was fully deterministic and worst-case efficient). – templatetypedef Sep 23 '18 at 00:39
  • Hm, then the Wikipedia article is at best confusing and possibly incorrect. – m69 ''snarky and unwelcoming'' Sep 23 '18 at 01:09
  • 1
    @m69 Yeah, I agree. I think I should go fix that. :-) – templatetypedef Sep 23 '18 at 01:59