What's the equivalent 'nth_element' function in Python?

Question

I want to implement Vantage Point Tree in the python, but it use the std::nth_element in the C++.

So I want to find the equivalent 'nth_element' function in Python or in numpy.

Notice, the nth_element would only partial order an array, and it's O(N).

int the_array[10] = {4,5,7,3,6,0,1,2,9,8};
std::vector<int> the_v(the_array,the_array+10);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10);

And now the vector could be:

3,0,2,1,4,5,6,7,9,8

And I not only want to get the nth element, but also want to get the re-arrange the two part of list, [3,0,2,1,4] and [6,7,9,8].

Moreover, nth_element support accept a function that could compare the two elements, such as, in the below as, the vector is a vector op DataPoint, and the DistanceComparator function will compare the two points distance with the_v.begin():

vector<DataPoint> the_v;
for(int n = 0; n < N; n++) the_v[n] = DataPoint(D, n, X + n * D);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10,
    DistanceComparator(the_v.begin()));

EDIT:

I've used the bhuvan-venkatesh's answer, and write some code to test.

partition_timer = timeit.Timer("numpy.partition(a, 10000)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(partition_timer.timeit(10))

sort_timer = timeit.Timer("numpy.sort(a)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(sort_timer.timeit(10))

sorted_timer = timeit.Timer("sorted(a)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(sorted_timer.timeit(10))

and the result:

2.2217168808
17.0386350155
281.301710844

And then, I will do more test using C++ code.

But there is a problem, when using the numpy, it would always return a new array, it will waste a lot of memory, when my array is huge. How can I handle it. Or I just have to write a C++ extend for python.

EDIT2:

@bhuvan-venkatesh Thanks for recommending the partition function.

I use partition like the below:

import numpy

@profile
def for_numpy():
    numpy.random.seed(2)
    a = numpy.random.rand(1e7)
    for i in range(100):
        a.partition(numpy.random.randint(1e6))

if __name__ == '__main__':
    for_numpy()

and ran the profiler like:

python -m memory_profiler profiler_test.py

and the result is:

Line #    Mem usage    Increment   Line Contents
================================================
    25   23.613 MiB    0.000 MiB   @profile
    26                             def for_numpy():
    27   23.613 MiB    0.000 MiB       numpy.random.seed(2)
    28   99.934 MiB   76.320 MiB       a = numpy.random.rand(1e7)
    29  100.004 MiB    0.070 MiB       for i in range(100):
    30  100.004 MiB    0.000 MiB           a.partition(numpy.random.randint(1e6))

And it would not copy the whole array like: numpy.partition(a, 3)

Conclusion: numpy.ndarray.partition is the one I want to find.

Did you read [this](http://stackoverflow.com/questions/1034846/finding-nth-item-of-unsorted-list-without-sorting-the-list)? — Paul Rooney, Aug 04 '16 at 01:35
I read it. But I am not only want to get nth element, but also want to arrange the list. — Colin Ji, Aug 04 '16 at 01:43
Hi user2357112, you mean the bhuvan-venkatesh answer, don't you? where are the tools? — Colin Ji, Aug 04 '16 at 04:29
I was actually talking about the one in Paul's link, but `numpy.partition` is a good implementation of those tools. — user2357112 supports Monica, Aug 04 '16 at 15:55

bhuvy · Accepted Answer · 2016-08-04T14:24:19.070

http://docs.scipy.org/doc/numpy/reference/generated/numpy.partition.html

Just make sure that the numpy partition will create two new arrays, meaning that you will quickly make a lot of new arrays. They are more efficient than python lists but will not do the exact same thing as in c++.

If you want the exact element then you can do a filter search which will still be O(n)

array = np.array(...)
partition = np.partition(array, 5) # O(n)
element = np.where(partition==array[5]) # O(n)
left, right = partition[:element], partition[element+1:] # O(n)

So your new code is slower but that is the python-y way to do it.

EDIT:

So you need a comparator? Apart from writing a small function of your own there is no way -- in pure numpy as a keyword -- because each numpy operation is implemented in highly optimized c-code meaning that passing in a python function or a python lambda would force numpy to go to the object level every time and eval.

numpy.vectorize goes to the object level, but in the end you will have to write your own code; Rosetta code has the impelmentation if you want to create a more "optimized algorithm". (I put that in quotes because with python objects you will still be much slower than c or numpy code because of object level access). If speed is your true concern, but you want the python readability consider making an extension with cython.

Why are you doing that thing with `where`? It doesn't look right; you seem to be picking the wrong element, and `where` returns a tuple. Shouldn't it just be `left, right = partition[:5], partition[6:]`? — user2357112 supports Monica, Aug 04 '16 at 04:26
Hi, I like your answer, but I just add some on my question. It means I also want like the nth_element, it could accept a comparator. — Colin Ji, Aug 04 '16 at 04:27
@user2357112 There is no guarantee that the fifth element remains in the fifth spot after the partition, so you need the where; You are right where you need to unpack the tuple (which I forgot) because the element could be repeated. — bhuvy, Aug 04 '16 at 13:58
@bhuvan-venkatesh: But we're looking for the fifth element in sorted order, not the fifth element in the original order. We don't need anything fancy to find the fifth element in the original order. — user2357112 supports Monica, Aug 04 '16 at 15:06

What's the equivalent 'nth_element' function in Python?

1 Answers1