Efficient way to get middle (median) of an std::set?

Question

std::set is a sorted tree. It provides begin and end methods so I can get minimum and maximum and lower_bound and upper_bound for binary search. But what if I want to get iterator pointing to the middle element (or one of them if there are even number of elements there)?

Is there an efficient way (O(log(size)) not O(size)) to do that?

{1} => 1
{1,2} => 1 or 2
{1,2,3} => 2
{1,2,3,4} => 2 or 3 (but in the same direction from middle as for {1,2})
{1,312,10000,14000,152333} => 10000

PS: Same question in Russian.

Sorted binary tree may be and usually is implementation detail of std::set but that is not required. If you need sorted array or a binary tree then it is better to use what you need. — Öö Tiib, Nov 19 '17 at 11:41
@ÖöTiib, I need to dynamically insert elements and get middle of the set. Sorted array/vector will cause insertion to be `O(n)`, but I'd like both insertion and query to work `O(lb(n))`. I know that Decart tree with implicit key allows to do that, but I don't want to implement it and hoped that `std::set` is good enough to achieve that. — Qwertiy, Nov 19 '17 at 12:07
@Qwertiy in most use cases inserting into a vector will be very fast due to cache locality. `std::set`, as well as linked lists, use pointers to child elements scattered everywhere, so it may be slower in many cases. Read [Why you should never, ever, EVER use linked-list in your code again](https://kjellkod.wordpress.com/2012/02/25/why-you-should-never-ever-ever-use-linked-list-in-your-code-again/), [Bjarne Stroustrup: Why you should avoid Linked Lists](https://youtu.be/YQs6IC-vgmo), [Are lists evil?](https://isocpp.org/blog/2014/06/stroustrup-lists) — phuclv, Nov 19 '17 at 12:45
Do you really need to have sorted elements or just the min, max and medium? In the latter case, consider using `std::nth_element` and a `std::vector`. — D Drmmr, Nov 19 '17 at 13:40
@DDrmmr, I need only medium, but logarithm to get it, not a full scan. Currently I think that the idea of keeping corresponding iterator is the best one. — Qwertiy, Nov 19 '17 at 14:22

pmdj · Accepted Answer · 2017-11-19T13:35:22.180

Depending on how often you insert/remove items versus look up the middle/median, a possibly more efficient solution than the obvious one is to keep a persistent iterator to the middle element and update it whenever you insert/delete items from the set. There are a bunch of edge cases which will need handling (odd vs even number of items, removing the middle item, empty set, etc.), but the basic idea would be that when you insert an item that's smaller than the current middle item, your middle iterator may need decrementing, whereas if you insert a larger one, you need to increment. It's the other way around for removals.

At lookup time, this is of course O(1), but it also has an essentially O(1) cost at each insertion/deletion, i.e. O(N) after N insertions, which needs to be amortised across a sufficient number of lookups to make it more efficient than brute forcing.

Clark · Answer 2 · 2019-08-05T08:36:44.410

This suggestion is pure magic and will fail if there are some duplicated items

Depending on how often you insert/remove items versus look up the middle/median, a possibly more efficient solution than the obvious one is to keep a persistent iterator to the middle element and update it whenever you insert/delete items from the set. There are a bunch of edge cases which will need handling (odd vs even number of items, removing the middle item, empty set, etc.), but the basic idea would be that when you insert an item that's smaller than the current middle item, your middle iterator may need decrementing, whereas if you insert a larger one, you need to increment. It's the other way around for removals.

Suggestions

first suggestion is to use a std::multiset instead of std::set, so that it can work well when items could be duplicated
my suggestion is to use 2 multisets to track the smaller potion and the bigger potion and balance the size between them

Algorithm

1. keep the sets balanced, so that size_of_small==size_of_big or size_of_small + 1 == size_of_big

void balance(multiset<int> &small, multiset<int> &big)
{
    while (true)
    {
        int ssmall = small.size();
        int sbig = big.size();

        if (ssmall == sbig || ssmall + 1 == sbig) break; // OK

        if (ssmall < sbig)
        {
            // big to small
            auto v = big.begin();
            small.emplace(*v);
            big.erase(v);
        }
        else 
        {
            // small to big
            auto v = small.end();
            --v;
            big.emplace(*v);
            small.erase(v);
        }
    }
}

2. if the sets are balanced, the medium item is always the first item in the big set

auto medium = big.begin();
cout << *medium << endl;

3. take caution when add a new item

auto v = big.begin();
if (v != big.end() && new_item > *v)
    big.emplace(new_item );
else
    small.emplace(new_item );

balance(small, big);

complexity explained

it is O(1) to find the medium value
add a new item takes O(log n)
you can still search a item in O(log n), but you need to search 2 sets

Adding is O(log(n)) not O(n). Anyway, keeping the median worked fine for me. — Qwertiy, Apr 04 '19 at 10:13
For me it seems that you answered question "Efficient way to get middle (median) of an std::multiset?" since `std::set` cannot `fail if there are some duplicated items`, as by definition it cannot have such. I'd suggest you to create new question about `std::multiset` and move this answer there. PS. Mods can move answer between questions without losing its score. — R2RT, Aug 05 '19 at 08:45

score 8 · Answer 3 · answered Nov 19 '17 at 11:55

8

It's going to be O(size) to get the middle of a binary search tree. You can get it with std::advance() as follows:

std::set<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);

answered Nov 19 '17 at 11:55

Martin Broadhurst

8,717
2
26
34

I think Martin mean O(height), where the height of a *balanced* binary tree is logarithmic in the size of the tree. – chepner Nov 19 '17 at 18:11
4

@chepner, nope, `std::advance` just calls `++` corresponding number of times in this case. – Qwertiy Nov 19 '17 at 22:44

Norgannon · Answer 4 · 2019-02-08T12:22:44.540

4

Be aware that the std::set does NOT store duplicate values. If you insert the following values {1, 2, 3, 3, 3, 3, 3, 3, 3}, the median you will retrieve is 2.

std::set<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);
int median = *it;

If you want to include duplicates when considering the median you can use std::multiset ({1, 2, 3, 3, 3, 3, 3, 3, 3} median's would be 3) :

std::multiset<int>::iterator it = s.begin();
std::advance(it, s.size() / 2);
int median = *it;

If the only reason you want the data sorted is to get the median, you are better off with a plain old std::vector + std::sort in my opinion.

With a large test sample and multiple iterations, I completed the test in 5s with std::vector and std::sort and 13 to 15s with either std::set or std::multiset. Your milage may vary depending on the size and number of duplicate values you have.

edited Feb 08 '19 at 12:22

answered Feb 08 '19 at 11:23

Norgannon

477
2
12

How is it related to my question? – Qwertiy Feb 08 '19 at 11:59
1

I think in most use cases, when you want the median, you want to get it from the full set of data and not the subset of unique values. I made the mistake so I thought I would add a mention to the `std::multiset` to prevent someone like me to do the same mistake. But you are right it does not answer directly the question. But more information in secondary answers can't hurt right ? – Norgannon Feb 08 '19 at 12:18

score 0 · Answer 5 · answered May 25 '21 at 08:20

As told by @pmdj we use the iterator to keep the track of the middle element. The below is the code implementation of the following :

class RollingMedian {
public:
multiset<int> order;
multiset<int>::iterator it;
RollingMedian() {
}

void add(int val) {
    order.insert(val);
    if (order.size() == 1) {
        it = order.begin();
    } else {
        if (val < *it and order.size() % 2 == 0) {
            --it;
        }
        if (val >= *it and order.size() % 2 != 0) {
            ++it;
        }
    }
}

double median() {
    if (order.size() % 2 != 0) {
        return double(*it);
    } else {
        auto one = *it, two = *next(it);
        return double(one + two) / 2.0;
    }
}  };

Feel free to copy and use any part of this code. Also, you can use set instead of multiset if the repetition is not there.

score -1 · Answer 6 · answered Nov 27 '17 at 18:32

-1

If your data is static, those you could precalcate it and do not insert new elements - it’s simplier to use vector , sort it , and access median just by index in O(1)

vector<int> data;
// fill data
std::sort(data.begin(), data.end());
auto median = data[data.size() / 2];

answered Nov 27 '17 at 18:32

ALEXANDER KONSTANTINOV

862
1
7
20

But you can't get a median in O(1) – ALEXANDER KONSTANTINOV Jul 12 '18 at 06:25