I want to implement Vantage Point Tree in the python, but it use the std::nth_element in the C++.
So I want to find the equivalent 'nth_element' function in Python or in numpy.
Notice, the nth_element would only partial order an array, and it's O(N).
int the_array[10] = {4,5,7,3,6,0,1,2,9,8};
std::vector<int> the_v(the_array,the_array+10);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10);
And now the vector could be:
3,0,2,1,4,5,6,7,9,8
And I not only want to get the nth element, but also want to get the re-arrange the two part of list, [3,0,2,1,4] and [6,7,9,8].
Moreover, nth_element support accept a function that could compare the two elements, such as, in the below as, the vector is a vector op DataPoint, and the DistanceComparator function will compare the two points distance with the_v.begin():
vector<DataPoint> the_v;
for(int n = 0; n < N; n++) the_v[n] = DataPoint(D, n, X + n * D);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10,
DistanceComparator(the_v.begin()));
EDIT:
I've used the bhuvan-venkatesh's answer, and write some code to test.
partition_timer = timeit.Timer("numpy.partition(a, 10000)",
"import numpy;numpy.random.seed(2);"+
"a = numpy.random.rand(10000000)")
print(partition_timer.timeit(10))
sort_timer = timeit.Timer("numpy.sort(a)",
"import numpy;numpy.random.seed(2);"+
"a = numpy.random.rand(10000000)")
print(sort_timer.timeit(10))
sorted_timer = timeit.Timer("sorted(a)",
"import numpy;numpy.random.seed(2);"+
"a = numpy.random.rand(10000000)")
print(sorted_timer.timeit(10))
and the result:
2.2217168808
17.0386350155
281.301710844
And then, I will do more test using C++ code.
But there is a problem, when using the numpy, it would always return a new array, it will waste a lot of memory, when my array is huge. How can I handle it. Or I just have to write a C++ extend for python.
EDIT2:
@bhuvan-venkatesh Thanks for recommending the partition function.
I use partition like the below:
import numpy
@profile
def for_numpy():
numpy.random.seed(2)
a = numpy.random.rand(1e7)
for i in range(100):
a.partition(numpy.random.randint(1e6))
if __name__ == '__main__':
for_numpy()
and ran the profiler like:
python -m memory_profiler profiler_test.py
and the result is:
Line # Mem usage Increment Line Contents
================================================
25 23.613 MiB 0.000 MiB @profile
26 def for_numpy():
27 23.613 MiB 0.000 MiB numpy.random.seed(2)
28 99.934 MiB 76.320 MiB a = numpy.random.rand(1e7)
29 100.004 MiB 0.070 MiB for i in range(100):
30 100.004 MiB 0.000 MiB a.partition(numpy.random.randint(1e6))
And it would not copy the whole array like: numpy.partition(a, 3)
Conclusion: numpy.ndarray.partition is the one I want to find.