9

I am using std::nth_element to get a (roughly correct) value for a percentile of a vector, like so:

double percentile(std::vector<double> &vectorIn, double percent)
{
    std::nth_element(vectorIn.begin(), vectorIn.begin() + (percent*vectorIn.size())/100, vectorIn.end());

    return vectorIn[(percent*vectorIn.size())/100];
}  

I noticed that for vectorIn lengths of up to 32 elements, the vector gets completely sorted. Starting from 33 elements it is never sorted (as expected).

Not sure whether this matters but the function is in a "(Matlab-)mex c++ code" that is compiled via Matlab using the "Microsoft Windows SDK 7.1 (C++)".

EDIT:

Also see the following histogram of the lengths of the longest sorted blocks in 1e5 vectors passed to the function (vectors contained 1e4 random elements and a random percentile was calculated). Note the peak at very small values.

Histogram of lengths of longes sorted blocks

火乔治
  • 19,950
  • 3
  • 44
  • 60
stack_horst
  • 292
  • 2
  • 10
  • 4
    The function does a partial sort in order to return the value you requested. How much of a partial sort it does is up to the implementation. – Jonathan Potter Feb 16 '15 at 19:20
  • Nope, not Mex related, but cool question. – chappjc Feb 16 '15 at 20:02
  • The spike at the left-hand-side of your plot looks a lot like the histogram of the length of the longest consecutive subsequence in a random vector. That might correspond to the small fraction of randomly selected percentile values so close to an end of the vector that the longest subsequence is in the part of the vector never touched by nth_vector. But that is just a guess. – rici Feb 16 '15 at 23:55
  • @rici: Good idea, but I checked it, and this is not the case. For the runs where the vector ended up with these very short sorted sequences, the corresponding percentiles were also evenly distibuted between 0 and 100. – stack_horst Feb 17 '15 at 21:36

1 Answers1

6

This will vary from standard library implementation to standard library implementation (and may vary based on other factors) but in general terms:

  • std::nth_element is allowed to rearrange the input container as it sees fit, provided that the nth_element is in position n, and the container is partitioned at position n.

  • For small containers, it is usually faster to do a full insertion-sort than a quickselect, even though that is not scalable.

Since standard library authors will usually opt for the fastest solution, most nth_element implementations (and, for that matter, sort implementations) use customized algorithms for small inputs (or for small segments at the bottom of the recursion), which may sort the container more aggressively than seems necessary. For vectors of scalar values, insertion sort is extremely fast, since it takes maximum advantage of the cache. With streaming extensions, it is possible to speed it up even more by doing parallel compares.

By the way, you can save a tiny amount of calculation by only computing the threshold iterator once, which might be more readable:

double percentile(std::vector<double> &vectorIn, double percent)
{
    auto nth = vectorIn.begin() + (percent*vectorIn.size())/100;
    std::nth_element(vectorIn.begin(), nth, vectorIn.end());
    return *nth;
}
rici
  • 201,785
  • 23
  • 193
  • 283
  • cannot vote yet, so first of all: thanks. do you have any comments on the plot i added? – stack_horst Feb 16 '15 at 21:11
  • @stack_horst: nice graph. But there are too many variables and I don't know the details of the Windows std:: implementation. Do you search for sorted runs throughout the vector or just up to the partition point? What was the range of the random percentile? and is it restricted to integer percentages? – rici Feb 16 '15 at 22:05
  • i am searching thoughout the whole vector. the 1e5 input vectors were each with 1e4 double values randomly distributed between 0 and 100 and the percentile was double rand between 0 and 100. – stack_horst Feb 16 '15 at 22:14