1

For a project, I want to compare the runtime of different median finding algorithms. I started with the "Medians of Medians" and basically used the code I found by Geeks for Geeks.

I tested it by comparing to the standard python method of calculating the median.

if __name__ == '__main__':
    arr = random.sample(range(1, 10000000000), 10000001) 
    arr1=arr[:] # i copied the list to make sure they both have the same starting position 
  
    t1=time.time()
    print("std median", statistics.median(arr))
    t2 = time.time()
    print("time std median:",t2-t1)
    t12 = time.time()
    n = len(arr1)
    k = n // 2 + 1 #median for odd number of elements
    print("Med of Med:", kthSmallest(arr1, 0, n - 1, k))
    t21 = time.time()
    print("time med of med:", t21-t12)

For an unknown reason my runtimes are way to high and just wrong. Finding the median in an array of ~10 Mio elements took the following time:

Standard Python method:                  13.28 seconds
My implementation of median of medians:  28.91 seconds

Is there something wrong with the implementation I found on Geek for Geeks? It should be the other way around. The standard Python method has a runtime of O(n log n) and Median of Medians runs in O(n), so it should be faster!

Does anyone know what I did wrong and could give me a hint how to fix it?

Jérôme Richard
  • 8,011
  • 1
  • 9
  • 30
maxpower
  • 37
  • 5
  • 1
    Issue is you're not directly comparing the two algorithms. One is pure Python (Geeks) while the other is a library module (statistics module which is probably C/C++ code). Would be better to compare them at several values of n and check if one increases linearly in time while the other increases by n*log(n). – DarrylG Nov 08 '20 at 10:18
  • Implementations matter. Thats the reason Quick-sort is the most widely accepted generic sorting algo, even though its time-complexity is O(n^2) – Serial Lazer Nov 08 '20 at 10:21
  • @SerialLazer--Agree that implementation matters, but quick-sort time complexity is O(n*log(n)). Also, Python uses a different sort algorithm generically than quick-sort for its built-in function, namely TimSort also used by Java, Swift, V8, Rust, etc. thus, it may be more popular. – DarrylG Nov 08 '20 at 10:28
  • @DarrylG QuickSort's worst case time-complexity is O(N^2), its O(NlogN) for the amortized case. – Serial Lazer Nov 08 '20 at 10:48
  • @DarrylG thanks for the tip i'm gonna try this. as for the statistics.median module it just sorts the list and than returns the value on the n // 2 position. It uses the standard python sorting function sorted(). I'm not sure how this one is implement, but I could write a sorting function by my own instead. Would that be better? – maxpower Nov 08 '20 at 10:53
  • @SerialLazer--right, but the amortized case is what's typically quoted since that's what's normally obtained (and to distinguish from truly bad algorithms like bubble, selection and insertion sort). By contrast, TimSort average and worst case is n*log(n), while best case is O(n). – DarrylG Nov 08 '20 at 10:55
  • 1
    @maxpower--Python sort uses TimSort as explained in [Timsort — the fastest sorting algorithm you’ve never heard of](https://hackernoon.com/timsort-the-fastest-sorting-algorithm-youve-never-heard-of-36b28417f399). Yes, for comparison it would be better to use pure Python sorting code such as [Qucksort](https://www.geeksforgeeks.org/python-program-for-quicksort/) – DarrylG Nov 08 '20 at 11:00
  • 1
    @maxpower--a faster method to find median would be to use heapify from the [heap module](https://docs.python.org/3.0/library/heapq.html). Heapify is [has O(n) complexity](https://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity). The first element in the heap would be the median for odd length array or the average of first and second elements for even length arrays. Since you're using a library you're using C/C++ code again. – DarrylG Nov 08 '20 at 11:12
  • @DarrylG thanks for the advice, but I also wanted to compare the runtimes to other median finding algorithms I implemented by my self. So I guess I've to either find all of them in a library or code them in python so the results don't differ that much. I also used numpy.median (which uses introselect) and runs pretty fast, but I guess this is uses C/C++ code as well :( [btw. you should have wrote a "real" answer to my question, so that I could give you a "thanks" or "accept your answer" to appreciate your help :) ] – maxpower Nov 08 '20 at 11:47
  • @maxpower--is the goal to 1) find the fastest algorithm overall, or 2) check the complexity of the algorithms? Fastest overall of course would be the Python modules with minimal pure Python code. If its the check the behaviors of the algorithms, you need to run the algorithm over a range of points such as 2 to 1M (normally using exponential spacing between points). – DarrylG Nov 08 '20 at 11:51
  • @DarrylG my main goal is to compare the runtimes of different algorithms with test data and than use that results to find the median of a real data set.I'm doing this for my thesis and my subject is "efficent ways of finding the media und large datasets - a comparison of different existing algorithms". So I talked to my prof and he said I should present each algorithm and explain how it works and for the time complexitiy it would be enough to show how it behaves with an increasing amout of data. – maxpower Nov 08 '20 at 12:04
  • So I'm not sure but I would guess for that kind of comparison it would be stupid if I would have some algortihms implement in pure pyhton code and some in C/C++ code. What do you think? – maxpower Nov 08 '20 at 12:06
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/224281/discussion-between-darrylg-and-maxpower). – DarrylG Nov 08 '20 at 12:35

0 Answers0