5

Apart from the median-of-medians algorithm, is there any other way to do k-selection in worst-case O(n) time? Does implementing median-of-medians make sense; I mean, is the performance advantage good enough for practical purposes ?

Bill the Lizard
  • 369,957
  • 201
  • 546
  • 842
Harman
  • 1,531
  • 1
  • 18
  • 31
  • Sorting first and then simply picking the kth element is only O(n log n) and there are fast implementations, so whether something more complicated for O(n) is worth it really depends on your specific details, such as the value of n. Also, don't forget quickselect with random pivoting, which is O(n) expected time. – ShreevatsaR Sep 09 '11 at 08:26

4 Answers4

12

There is another algorithm for computing kth order statistics based on the soft heap data structure, which is a variant on a standard priority queue that is allowed to "corrupt" some number of the priorities it stores. The algorithm is described in more detail on the Wikipedia article, but the basic idea is to use the soft heap to efficiently (O(n) time) pick a pivot for the partition function that has a guarantee of a good split. In a sense, this is simply a modified version of the median-of-medians algorithm that uses an (arguably) more straightforward approach to choosing the pivot element.

Soft heaps are not particularly intuitive, but there is a pretty good description of them available in this paper ("A simpler implementation and analysis of Chazelle's soft heaps"), which includes a formal description and analysis if the data structure.

However, if you want a really fast, worst-case O(n) algorithm, consider looking into introselect. This algorithm is actually quite brilliant. It starts off by using the quickselect algorithm, which picks a pivot unintelligently and uses it to partition the data. This is extremely fast in practice, but has bad worst-case behavior. Introselect fixes this by keeping track of an internal counter that tracks its progress. If the algorithm ever looks like it's about to degrade to O(n2) time, it switches algorithms and uses something like median-of-medians to ensure the worst-case guarantee. Specifically, it watches how much of the array is discarded at each step, and if some constant number of steps occur before half the input is discarded, the algorithm switches to the median-of-medians algorithm to ensure that the next pivot is good before then restarting using quickselect. This guarantees worst-case O(n) time.

The advantage of this algorithm is that it's extremely fast on most inputs (since quickselect is very fast), but has great worst-case behavior. A description of this algorithm, along with the related sorting algorithm introsort, can be found in this paper ("Introspective Sorting and Selection Algorithms").

Hope this helps!

templatetypedef
  • 328,018
  • 92
  • 813
  • 992
  • Can you please provide the paper names as well. The 1st link seems to be not correct. Also do you have any good explanation available for median of median algorithms. – Dexters May 16 '16 at 01:36
  • @Dexters Links updated and paper tiles included! As for median-of-medians, I don't have a go-to good resource on it. Having taught it before in an algorithms class, I've found that the major sticking point most people have is the recursion - even people will a good handle on recursion have trouble understanding why the recursive call works. If you do find any good links on it, please feel free to let me know! – templatetypedef May 16 '16 at 03:04
  • sure i am looking out for it to get a deeper understanding for that algorithm. Appreciate for updating the link also I looked up soft heaps. Very interesting. – Dexters May 29 '16 at 07:21
3

I think that you should really test it and find out what the performance is, when you have N million elements in your container. This algorithm has already been implemented in the STL library (C++) as std::nth_element is guarantueed to be expected O(n). So if you used C++, you could easily run some tests and see if the performance is good enough for what you seek.

A notable exception is C++, which provides a templated nth_element method with a guarantee of expected linear time.

Tony The Lion
  • 57,181
  • 57
  • 223
  • 390
  • Nice to know that, as a matter of fact I do use C++. – Harman Sep 09 '11 at 16:40
  • 1
    I could be wrong about this, but doesn't the text above say that the algorithm has to run in **expected** O(n) time and not **worst-case** O(n) time? – templatetypedef Sep 09 '11 at 18:27
  • I apologize if I'm being too critical, but doesn't this still not answer the OP's question, which is to find a better worst-case O(n) selection algorithm? – templatetypedef Sep 10 '11 at 21:06
1

It depends. If you're concerned about the worst case happening accidentally, I wouldn't bother. As the data grows large enough to care, the worst case becomes so unlikely that it's hardly worth protecting against.

If you're doing the selection in a situation where a client could provide the data in the worst-case order to do a denial of service on your server, then it's probably worth using a median of medians to assure that the worst-case order doesn't hurt performance to any significant degree.

Jerry Coffin
  • 437,173
  • 71
  • 570
  • 1,035
0

Updated:

There is a linear time algorithm, a modification to quick sort, suggest by quicksort's inventor Hoare himself. I suggest referring to the section 9.3 "Selection in worst-case linear time" in CLRS book. Here is the brief algorithm, assuming we have a method random_partition from quicksort (which uses a random pivot for partition):

FindKth(array, l, u, k)
{
   int m = random_partition(array, l, u);
   if m == k : return array[k] /*we have found the kth element*/
   if m > k: return FindKth(array, l, m-1, k); /* we have found element > kth largest, concentrate on the left partition */
   else: return FindKth(array, m+1, u, k-m); /* find the k-m th element in the right partition */
}

You can also refer to Donald Knuth's TAOCP Vol.3 Sorting and Searching p.633 The beauty of this method is that, the array need not be completely sorted! I think the STL nth_permutation uses this technique, you can refer to the notes section.

vine'th
  • 4,516
  • 2
  • 25
  • 26
  • 1
    This is QuickSelect, which is only expected linear time (if you pick a random pivot), but quadratic time in the worst case. – ShreevatsaR Sep 10 '11 at 14:18
  • Yes you are right; The CLRS book uses a randomized partition scheme, ensuring a linear run time, you can refer to the above mentioned section. – vine'th Sep 10 '11 at 14:28
  • 1
    Even the randomized pivot selection doesn't guarantee linear time. It just says that **on expectation** the behavior is linear. You absolutely can degrade to O(n^2) with this algorithm. – templatetypedef Sep 10 '11 at 19:09
  • The degradation to O(n^2) happens with a very less probability in case of random pivot, around 10^-8 or so, see Knuth's TAOCP Vol.3 p.122 for Knuth's mathematical analysis. I find it tough to digest his mathematics :) Knuth simply says, "Even a mildly random choice of q should be safe." I believe nth_permutation of STL uses same algorithm, evident by the notes section. Even they use the prefix "on average, linear" – vine'th Sep 11 '11 at 06:32