51

Let's say I need to retrieve the median from a sequence of 1000000 random numeric values.

If using anything but std::list, I have no (built-in) way to sort sequence for median calculation.

If using std::list, I can't randomly access values to retrieve middle (median) of sorted sequence.

Is it better to implement sorting myself and go with e.g. std::vector, or is it better to use std::list and use std::list::iterator to for-loop-walk to the median value? The latter seems less overheadish, but also feels more ugly..

Or are there more and better alternatives for me?

xskxzr
  • 10,946
  • 11
  • 32
  • 70
sharkin
  • 11,386
  • 19
  • 83
  • 119

10 Answers10

108

Any random-access container (like std::vector) can be sorted with the standard std::sort algorithm, available in the <algorithm> header.

For finding the median, it would be quicker to use std::nth_element; this does enough of a sort to put one chosen element in the correct position, but doesn't completely sort the container. So you could find the median like this:

int median(vector<int> &v)
{
    size_t n = v.size() / 2;
    nth_element(v.begin(), v.begin()+n, v.end());
    return v[n];
}
Mike Seymour
  • 235,407
  • 25
  • 414
  • 617
  • 2
    Huh. I didn't realize that `nth_element` existed, I apparently re-implemented it in my answer... – ephemient Nov 12 '09 at 03:10
  • 7
    It should be noted that `nth_element` modifies the vector in unpredictable ways! You might want to sort a vector of indexes if necessary. – Matthieu M. Nov 12 '09 at 13:27
  • 46
    If the number of items is even, the median is the average of the middle *two*. – sje397 Jul 02 '10 at 01:12
  • 5
    @sje397 true, this algorithm is incorrect half of the times, namely when the vector contains an even number of elements. Is calling the nth_element function 2 times (for the 2 middle elements) costlier than calling sort once? Thanks. – Agostino Jan 21 '15 at 14:06
  • Sorting only half the vector using partial_sort() also does the job and allows to get the right median for even-sized vectors. Has anybody checked how expensive nth_element is if called twice? – Fabian Jul 10 '18 at 07:04
  • 1
    @Fabian partial_sort is still O(N*log(N)) and nth_element is O(N) (or O(2N) if done twice, which is still linear) so I would expect nth_element to be faster as N increases, but I haven't done any analysis to confirm that. – ClydeTheGhost Dec 04 '18 at 22:22
  • 1
    @sje397 according to this definition https://mathworld.wolfram.com/StatisticalMedian.html of median it's true that this current answer is wrong when the vector size is even... but just for fun consider what it is written in Numerical Recipes: _When N is even, statistics books define the median as the arithmetic mean of the elements k = N/2 and k = N/2 + 1 (that is, N/2 from the bottom and N/2 from the top). If you accept such pedantry, you must perform two separate selections to find these elements. For N > 100 we usually define k = N/2 to be the median element, **pedants be damned**._ – Alessandro Jacopson Aug 05 '20 at 08:23
38

The median is more complex than Mike Seymour's answer. The median differs depending on whether there are an even or an odd number of items in the sample. If there are an even number of items, the median is the average of the middle two items. This means that the median of a list of integers can be a fraction. Finally, the median of an empty list is undefined. Here is code that passes my basic test cases:

///Represents the exception for taking the median of an empty list
class median_of_empty_list_exception:public std::exception{
  virtual const char* what() const throw() {
    return "Attempt to take the median of an empty list of numbers.  "
      "The median of an empty list is undefined.";
  }
};

///Return the median of a sequence of numbers defined by the random
///access iterators begin and end.  The sequence must not be empty
///(median is undefined for an empty set).
///
///The numbers must be convertible to double.
template<class RandAccessIter>
double median(RandAccessIter begin, RandAccessIter end) 
  throw(median_of_empty_list_exception){
  if(begin == end){ throw median_of_empty_list_exception(); }
  std::size_t size = end - begin;
  std::size_t middleIdx = size/2;
  RandAccessIter target = begin + middleIdx;
  std::nth_element(begin, target, end);

  if(size % 2 != 0){ //Odd number of elements
    return *target;
  }else{            //Even number of elements
    double a = *target;
    RandAccessIter targetNeighbor= target-1;
    std::nth_element(begin, targetNeighbor, end);
    return (a+*targetNeighbor)/2.0;
  }
}
Eponymous
  • 4,271
  • 3
  • 37
  • 40
  • 19
    I know this is from forever ago, but because I just found this on the google: `std::nth_element` actually also guarantees that any preceding elements are <= the target and any following elements are >=. So you could just use `targetNeighbor = std::min_element(begin, target)` and skip the partial sort, which is probably a little bit faster. (`nth_element` is on-average linear, while `min_element` is obviously linear.) And even if you'd rather use `nth_element` again, it'd be equivalent and probably a little faster to just do `nth_element(begin, targetNeighbor, target)`. – Danica Feb 08 '12 at 21:11
  • 10
    @Dougal I take it you meant `targetNeighbor = std::max_element(begin, target)` in this case? – izak May 09 '13 at 02:41
  • @Dougal I know this comment is from forever ago ;), but I have no clue how your approach is supposed to work, are you sure that this gives the correct result? – 463035818_is_not_a_number Sep 02 '16 at 18:42
  • @tobi303 Your forever is twice as long as mine. :) And yes, it definitely should: the point is that after calling `std::nth_element`, the sequence is like `[smaller_than_target, target, bigger_than_target]`. So you know that the `target-1`th element is in the first half of the array, and you only need to find the max of the elements before `target` to get the median. – Danica Sep 02 '16 at 18:47
  • @Dougal ah now I got it. Thanks – 463035818_is_not_a_number Sep 02 '16 at 19:53
  • Note that `nth_element()` doesn't work (at least in g++) when target is equal to `end()`. So you have two more special cases: when the range is just 1 or 2 elements. – Alexis Wilke Oct 09 '18 at 15:54
  • @AlexisWilke why would these be special cases? If there's 1 element, then `size=1`, `middleIdx=0`, `target=begin`, so `nth_element` is a no-op. If there're 2 elements, `size=2`, `middleIdx=1`, `target=begin+1=end-1`, so the first `nth_element` is called with `(begin, end-1, end)` and the second one with `(begin,begin,end)`. Nowhere does `target` equal `end. – Ruslan May 16 '19 at 11:16
14

This algorithm handles both even and odd sized inputs efficiently using the STL nth_element (amortized O(N)) algorithm and the max_element algorithm (O(n)). Note that nth_element has another guaranteed side effect, namely that all of the elements before n are all guaranteed to be less than v[n], just not necessarily sorted.

//post-condition: After returning, the elements in v may be reordered and the resulting order is implementation defined.
double median(vector<double> &v)
{
  if(v.empty()) {
    return 0.0;
  }
  auto n = v.size() / 2;
  nth_element(v.begin(), v.begin()+n, v.end());
  auto med = v[n];
  if(!(v.size() & 1)) { //If the set size is even
    auto max_it = max_element(v.begin(), v.begin()+n);
    med = (*max_it + med) / 2.0;
  }
  return med;    
}
Matthew Fioravante
  • 1,320
  • 13
  • 18
  • 1
    I like your answer but returning zero when the vector is empty is not suitable to my application where I would prefer an exception in case of an empty vector. – Alessandro Jacopson Aug 05 '20 at 08:18
11

Here's a more complete version of Mike Seymour's answer:

// Could use pass by copy to avoid changing vector
double median(std::vector<int> &v)
{
  size_t n = v.size() / 2;
  std::nth_element(v.begin(), v.begin()+n, v.end());
  int vn = v[n];
  if(v.size()%2 == 1)
  {
    return vn;
  }else
  {
    std::nth_element(v.begin(), v.begin()+n-1, v.end());
    return 0.5*(vn+v[n-1]);
  }
}

It handles odd- or even-length input.

Alec Jacobson
  • 5,326
  • 4
  • 41
  • 75
  • 1
    For pass by copy, did you mean to remove the reference (`&`) on the input? – chappjc Jun 17 '14 at 17:00
  • 1
    I just meant that comment as a note that one _could_ use pass by copy, in which case yes one should remove the `&`. – Alec Jacobson Jun 17 '14 at 18:18
  • There is a bug in this version. You need to extract `v[n]` before doing nth_element again because after the second round `v[n]` may contain a different value. – Matthew Fioravante Dec 05 '14 at 16:44
  • 1
    @MatthewFioravante, I see. According to the [docs](http://en.cppreference.com/w/cpp/algorithm/nth_element), I guess nth_element does not need to be stable. (edited my answer, accordingly). – Alec Jacobson Dec 11 '14 at 16:06
  • 3
    Instead of calling `nth_element` a second time, wouldn't it be much more efficient to just iterate from `v[0]` to `v[n]` and determine the maximum in that half? – bluenote10 Oct 23 '16 at 10:19
5

putting together all the insights from this thread I ended up having this routine. it works with any stl-container or any class providing input iterators and handles odd- and even-sized containers. It also does its work on a copy of the container, to not modify the original content.

template <typename T = double, typename C>
inline const T median(const C &the_container)
{
    std::vector<T> tmp_array(std::begin(the_container), 
                             std::end(the_container));
    size_t n = tmp_array.size() / 2;
    std::nth_element(tmp_array.begin(), tmp_array.begin() + n, tmp_array.end());

    if(tmp_array.size() % 2){ return tmp_array[n]; }
    else
    {
        // even sized vector -> average the two middle values
        auto max_it = std::max_element(tmp_array.begin(), tmp_array.begin() + n);
        return (*max_it + tmp_array[n]) / 2.0;
    }
}
Croc Dialer
  • 51
  • 1
  • 2
  • As Matthew Fioravante http://stackoverflow.com/questions/1719070/what-is-the-right-approach-when-using-stl-container-for-median-calculation#comment43100577_24235744 has mentioned, "You need to extract v[n] before doing nth_element again because after the second round v[n] may contain a different value." So, let med = tmp_array[n], then the correct return line is: return (*max_it + med) / 2.0; – trig-ger Jan 19 '17 at 11:07
  • 4
    @trig-ger nth_element is only used once in this solution. It is not a problem. – denver Jan 31 '17 at 18:55
  • `static_assert(std::is_same_v, "mismatched container and element types")` maybe? – einpoklum May 21 '21 at 16:17
4

You can sort an std::vector using the library function std::sort.

std::vector<int> vec;
// ... fill vector with stuff
std::sort(vec.begin(), vec.end());
Charles Salvia
  • 48,775
  • 12
  • 118
  • 138
2

There exists a linear-time selection algorithm. The below code only works when the container has a random-access iterator, but it can be modified to work without — you'll just have to be a bit more careful to avoid shortcuts like end - begin and iter + n.

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <sstream>
#include <vector>

template<class A, class C = std::less<typename A::value_type> >
class LinearTimeSelect {
public:
    LinearTimeSelect(const A &things) : things(things) {}
    typename A::value_type nth(int n) {
        return nth(n, things.begin(), things.end());
    }
private:
    static typename A::value_type nth(int n,
            typename A::iterator begin, typename A::iterator end) {
        int size = end - begin;
        if (size <= 5) {
            std::sort(begin, end, C());
            return begin[n];
        }
        typename A::iterator walk(begin), skip(begin);
#ifdef RANDOM // randomized algorithm, average linear-time
        typename A::value_type pivot = begin[std::rand() % size];
#else // guaranteed linear-time, but usually slower in practice
        while (end - skip >= 5) {
            std::sort(skip, skip + 5);
            std::iter_swap(walk++, skip + 2);
            skip += 5;
        }
        while (skip != end) std::iter_swap(walk++, skip++);
        typename A::value_type pivot = nth((walk - begin) / 2, begin, walk);
#endif
        for (walk = skip = begin, size = 0; skip != end; ++skip)
            if (C()(*skip, pivot)) std::iter_swap(walk++, skip), ++size;
        if (size <= n) return nth(n - size, walk, end);
        else return nth(n, begin, walk);
    }
    A things;
};

int main(int argc, char **argv) {
    std::vector<int> seq;
    {
        int i = 32;
        std::istringstream(argc > 1 ? argv[1] : "") >> i;
        while (i--) seq.push_back(i);
    }
    std::random_shuffle(seq.begin(), seq.end());
    std::cout << "unordered: ";
    for (std::vector<int>::iterator i = seq.begin(); i != seq.end(); ++i)
        std::cout << *i << " ";
    LinearTimeSelect<std::vector<int> > alg(seq);
    std::cout << std::endl << "linear-time medians: "
        << alg.nth((seq.size()-1) / 2) << ", " << alg.nth(seq.size() / 2);
    std::sort(seq.begin(), seq.end());
    std::cout << std::endl << "medians by sorting: "
        << seq[(seq.size()-1) / 2] << ", " << seq[seq.size() / 2] << std::endl;
    return 0;
}
ephemient
  • 180,829
  • 34
  • 259
  • 378
2

Here is an answer that considers the suggestion by @MatthieuM. ie does not modify the input vector. It uses a single partial sort (on a vector of indices) for both ranges of even and odd cardinality, while empty ranges are handled with exceptions thrown by a vector's at method:

double median(vector<int> const& v)
{
    bool isEven = !(v.size() % 2); 
    size_t n    = v.size() / 2;

    vector<size_t> vi(v.size()); 
    iota(vi.begin(), vi.end(), 0); 

    partial_sort(begin(vi), vi.begin() + n + 1, end(vi), 
        [&](size_t lhs, size_t rhs) { return v[lhs] < v[rhs]; }); 

    return isEven ? 0.5 * (v[vi.at(n-1)] + v[vi.at(n)]) : v[vi.at(n)];
}

Demo

Lorah Attkins
  • 3,879
  • 2
  • 22
  • 52
1

Armadillo has an implementation that looks like the one in the answer https://stackoverflow.com/a/34077478 by https://stackoverflow.com/users/2608582/matthew-fioravante

It uses one call to nth_element and one call to max_element and it is here: https://gitlab.com/conradsnicta/armadillo-code/-/blob/9.900.x/include/armadillo_bits/op_median_meat.hpp#L380

//! find the median value of a std::vector (contents is modified)
template<typename eT>
inline 
eT
op_median::direct_median(std::vector<eT>& X)
  {
  arma_extra_debug_sigprint();
  
  const uword n_elem = uword(X.size());
  const uword half   = n_elem/2;
  
  typename std::vector<eT>::iterator first    = X.begin();
  typename std::vector<eT>::iterator nth      = first + half;
  typename std::vector<eT>::iterator pastlast = X.end();
  
  std::nth_element(first, nth, pastlast);
  
  if((n_elem % 2) == 0)  // even number of elements
    {
    typename std::vector<eT>::iterator start   = X.begin();
    typename std::vector<eT>::iterator pastend = start + half;
    
    const eT val1 = (*nth);
    const eT val2 = (*(std::max_element(start, pastend)));
    
    return op_mean::robust_mean(val1, val2);
    }
  else  // odd number of elements
    {
    return (*nth);
    }
  }
Alessandro Jacopson
  • 16,221
  • 13
  • 91
  • 139
0
you can use this approch. It also takes care of sliding window.
Here days are no of trailing elements for which we want to find median and this makes sure the original container is not changed


#include<bits/stdc++.h>

using namespace std;

int findMedian(vector<int> arr, vector<int> brr, int d, int i)
{
    int x,y;
    x= i-d;
    y=d;
    brr.assign(arr.begin()+x, arr.begin()+x+y);


    sort(brr.begin(), brr.end());

    if(d%2==0)
    {
        return((brr[d/2]+brr[d/2 -1]));
    }

    else
    {
        return (2*brr[d/2]);
    }

    // for (int i = 0; i < brr.size(); ++i)
    // {
    //     cout<<brr[i]<<" ";
    // }

    return 0;

}

int main()
{
    int n;
    int days;
    int input;
    int median;
    int count=0;

    cin>>n>>days;

    vector<int> arr;
    vector<int> brr;

    for (int i = 0; i < n; ++i)
    {
        cin>>input;
        arr.push_back(input);
    }

    for (int i = days; i < n; ++i)
    {
        median=findMedian(arr,brr, days, i);

        
    }



    return 0;
}