4

I've written this implementation of the median of medians algorithm in python, but it doesn't seem to output the right result, and it also does not seem of linear complexity to me, any idea where I went off track ?

def select(L):
    if len(L) < 10:
        L.sort()
        return L[int(len(L)/2)]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        print(subList)
    Meds.append(select(subList))
    L2 = select(Meds)
    L1 = L3 = []
    for i in L:
        if i < L2:
            L1.append(i)
        if i > L2:
            L3.append(i)
    if len(L) < len(L1):
        return select(L1)
    elif len(L) > len(L1) + 1:
        return select(L3)
    else:
        return L2

The function is called like so:

L = list(range(100))
shuffle(L)
print(select(L))

LE: Sorry. GetMed was a function that simply sorted the list and returned the element at len(list), it should've been select there, I fixed it now, but still I get the wrong outputs. As for the indentation, the code works without error, and I see nothing wrong with it :-??

LE2: I'm expecting 50 (for the current L), it gives me outputs from 30 to 70, no more no less (yet)

LE3: Thank you very much, that did the trick it works now. I'm confuse though, I'm trying to make a comparison between this method, and the naive one, where I simply sort the array and output the results. Now, from what I read so far, the time complexity of the select method should be O(n) Deterministic Selection. Although I probably can't compete with the optimisation python developers did, I did expect closer results than I got, for example, if I change the range of the list to 10000000, select outputs the result in 84.10837116255952 seconds while the sort and return method does it in 18.92556029528825. What are some good ways to make this algorithm faster?

cpp_ninja
  • 317
  • 1
  • 4
  • 14

2 Answers2

4

1) Your code indentation is wrong, try this:

def select(L):
    if len(L) < 10:
        L.sort()
        return L[int(len(L)/2)]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        print(subList)
        Meds.append(select(subList))
    L2 = select(Meds)
    L1 = L3 = []
    for i in L:
        if i < L2:
            L1.append(i)
        if i > L2:
            L3.append(i)
    if len(L) < len(L1):
        return select(L1)
    elif len(L) > len(L1) + 1:
        return select(L3)
    else:
        return L2

2) The method you use does not return the median, it just return a number which is not so far from the median. To get the median, you need to count how many number are greater than your pseudo-median, if a majority is greater, repeat the algorithm with the numbers greater than the pseudo-median, else repeat with the other numbers.

def select(L, j):
    if len(L) < 10:
        L.sort()
        return L[j]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        Meds.append(select(subList, int((len(subList)-1)/2)))
    med = select(Meds, int((len(Meds)-1)/2))
    L1 = []
    L2 = []
    L3 = []
    for i in L:
        if i < med:
            L1.append(i)
        elif i > med:
            L3.append(i)
        else:
            L2.append(i)
    if j < len(L1):
        return select(L1, j)
    elif j < len(L2) + len(L1):
        return L2[0]
    else:
        return select(L3, j-len(L1)-len(L2))

Warning: L = M = [] is not L = [] and M = []

Thomash
  • 6,173
  • 1
  • 26
  • 49
  • Would fail for a simple test case 1, 2, 3, 4, 4, 5, 6, 12, 17, 20 # returns 5, should return 4.5 – waka-waka-waka Nov 11 '13 at 05:53
  • @VikhyathReddy No, 4.5 is not an element of the sequence, how can it be the median ? – Thomash Nov 12 '13 at 14:10
  • @waka-waka-waka have you seen this comment by @thomash? – tommy.carstensen Apr 21 '15 at 18:23
  • Old post, but @waka-waka-waka is right. The median of a list of even number is usually taken as the mean of the TWO central elements, otherwise, should you take the lower or higher value, you'd be biased. – Jblasco Oct 18 '15 at 14:56
  • @Jblasco Returning an element that is not in the input may be incorrect depending on what you need a median for. In this example there is a set of integers and you want to return a fractional number which may not be appropriate, for integers you know that there is a superset with good properties that allows you to return something between 4 and 5 but it won't work for any kind of object. It is not possible to have a perfect definition of median which guarantees both existence and unicity but my definition is good enough for **all** practical purposes. – Thomash Oct 19 '15 at 09:33
  • Hi Thomash, probably at this point it would be more sensible to agree to disagree, but I will make a last try. I don't know about other kind of object that will not allow a similar trick for the median. In any case, as the definition of median goes, for me it's clear enough that you cannot give a biased answer by returning always the smaller of the two. I would always defend the average of the central two points because, lacking more information about the distribution, a point in between 4 and 5 will *really* split the list in two sections of equal number of elements. – Jblasco Oct 19 '15 at 13:39
  • I wouldn't dare saying that my definition is good enough for *all* practical purposes, though... That's too good to be true, and of that I am absolutely certain. – Jblasco Oct 19 '15 at 13:40
  • "I don't know about other kind of object that will not allow a similar trick for the median." >> strings for instance. And if you want to *really* split the list in two sections of equal number of elements, how do you find the median of (0, 0, 1)? – Thomash Oct 19 '15 at 14:37
  • it would be more sensible to agree to disagree >> totally, this discussion is just about a small detail nobody cares about (that's why I claim my solution is good for all practical purposes because in real life nobody cares about this). – Thomash Oct 19 '15 at 14:39
  • Strings? As in, what is the median of ['potato', 'carrot', 'spoon']? The median of a list of odd number of elements is clearly defined: in your case zero is the median, because it splits the list into a 0 on one side, and 1 on the other. And if you say of (0,0,0,1), it's still zero, because there is two elements that are below or equal to zero, and two above or equal to it. – Jblasco Oct 19 '15 at 15:34
  • I do care about my measures of centralization not being biased, thank you very much. – Jblasco Oct 19 '15 at 15:34
  • "And if you say of (0,0,0,1), it's still zero, because there is two elements that are below or equal to zero, and two above or equal to it" >> I count three and four. – Thomash Oct 19 '15 at 15:38
  • More clearly written: "I can split my list in two sublists that contain half the elements each. Those lists contain elements (not THE elements) that are less or equal, in one case (0,0) and greater or equal (0,1)". – Jblasco Oct 19 '15 at 15:53
  • If you just want to split the list in two parts of the same size, I really don't understand what you don't like in my algorithm. – Thomash Oct 19 '15 at 15:58
  • Despite the theological war about the definition of median, also one can make the argument of consistency. I can find many references as to what to do when the list has an even number of elements, and not a single one of them says to use the lowest/highest of the two central values. – Jblasco Oct 19 '15 at 15:58
  • Your algorithm is biased, in case of two elements being considered 'central' it returns the higher. If you do a random sample of 10 numbers from any distribution, calculate the median from them,and compare it with the real one, you should see that your median with 10 elements is *systematically* above the real median. If you use the mean of the two central values as the median you should see as often as not that it is above the more precise value. I cannot argue it much better than this, sorry! – Jblasco Oct 19 '15 at 16:05
  • For any computer science students who stumble across this, it is far more practical for the median of medians algorithm to tie-break to either the lower or higher central value. See CLRS p220. – Pockets Mar 17 '16 at 17:06
  • how come L = M = [] is not L = [] and M = [] ? – Claudiu Creanga Oct 29 '19 at 17:10
  • @ClaudiuCreanga If you use `L = M = []` you only create one list with two names which means that when you modify the list named `L` you also modify the list named `M` at the same time. Try `L.append(5)` and look at `M`. – Thomash Nov 01 '19 at 06:00
2

Below is my PYTHON implementation. For more speed, you might want to use PYPY instead.

For your question about SPEED: The theoretical speed for 5 numbers per column is ~10N, so I use 15 numbers per column, for a 2X speed at ~5N, while the optimal speed is ~4N. But, I could be wrong about the optimal speed of the most state-of-art solution. In my own test, my program runs slightly faster than the one using sort(). Certainly, your mileage may vary.

Assuming the python program is "median.py", an example to run it is "python ./median.py 100". For speed benchmark, you might want to comment out the validation code, and use PYPY.

#!/bin/python
#
# TH @stackoverflow, 2016-01-20, linear time "median of medians" algorithm
#
import sys, random


items_per_column = 15


def find_i_th_smallest( A, i ):
    t = len(A)
    if(t <= items_per_column):
        # if A is a small list with less than items_per_column items, then:
        #     1. do sort on A
        #     2. return the i-th smallest item of A
        #
        return sorted(A)[i]
    else:
        # 1. partition A into columns of items_per_column items each. items_per_column is odd, say 15.
        # 2. find the median of every column
        # 3. put all medians in a new list, say, B
        #
        B = [ find_i_th_smallest(k, (len(k) - 1)/2) for k in [A[j:(j + items_per_column)] for j in range(0,len(A),items_per_column)]]

        # 4. find M, the median of B
        #
        M = find_i_th_smallest(B, (len(B) - 1)/2)

        # 5. split A into 3 parts by M, { < M }, { == M }, and { > M }
        # 6. find which above set has A's i-th smallest, recursively.
        #
        P1 = [ j for j in A if j < M ]
        if(i < len(P1)):
            return find_i_th_smallest( P1, i)
        P3 = [ j for j in A if j > M ]
        L3 = len(P3)
        if(i < (t - L3)):
            return M
        return find_i_th_smallest( P3, i - (t - L3))


# How many numbers should be randomly generated for testing?
#
number_of_numbers = int(sys.argv[1])


# create a list of random positive integers
#
L = [ random.randint(0, number_of_numbers) for i in range(0, number_of_numbers) ]


# Show the original list
#
print L


# This is for validation
#
print sorted(L)[int((len(L) - 1)/2)]


# This is the result of the "median of medians" function.
# Its result should be the same as the validation.
#
print find_i_th_smallest( L, (len(L) - 1) / 2)