1

I'm a programming student, and for a project I'm working on, on of the things I have to do is compute the median value of a vector of int values and must be done by passing it through functions. Also the vector is initially generated randomly using the C++ random generator mt19937 which i have already written down in my code.I'm to do this using the sort function and vector member functions such as .begin(), .end(), and .size().

I'm supposed to make sure I find the median value of the vector and then output it

And I'm Stuck, below I have included my attempt. So where am I going wrong? I would appreciate if you would be willing to give me some pointers or resources to get going in the right direction.

Code:

#include<iostream>
#include<vector>
#include<cstdlib>
#include<ctime>
#include<random>
#include<vector>
#include<cstdlib>
#include<ctime>
#include<random>

using namespace std;

double find_median(vector<double>);
double find_median(vector<double> len)
{
    {
        int i;
        double temp;
        int n=len.size();
        int mid;
        double median;
        bool swap;

        do
        {
            swap = false;
            for (i = 0; i< len.size()-1; i++)
            {
                if (len[i] > len[i + 1])
                {
                    temp = len[i];
                    len[i] = len[i + 1];
                    len[i + 1] = temp;
                    swap = true;
                }
            }
        }
        while (swap);

        for (i=0; i<len.size(); i++)
        {
            if (len[i]>len[i+1])
            {
                temp=len[i];
                len[i]=len[i+1];
                len[i+1]=temp;
            }
            mid=len.size()/2;
            if (mid%2==0)
            {
                median= len[i]+len[i+1];
            }
            else
            {
                median= (len[i]+0.5);
            }
        }
        return median;
    }
}
    int main()
    {
        int n,i;

        cout<<"Input the vector size: "<<endl;
        cin>>n;
        vector <double> foo(n);
        mt19937 rand_generator;
        rand_generator.seed(time(0));

        uniform_real_distribution<double> rand_distribution(0,0.8);
        cout<<"original vector: "<<" ";
        for (i=0; i<n; i++)
        {
            double rand_num=rand_distribution(rand_generator);
            foo[i]=rand_num;
            cout<<foo[i]<<" ";

        }
double median;
        median=find_median(foo);

        cout<<endl;

        cout<<"The median of the vector is:  "<<" ";
cout<<median<<endl;
    }
Nicole Alison
  • 61
  • 1
  • 1
  • 3

3 Answers3

16

The median is given by

const auto median_it = len.begin() + len.size() / 2;
std::nth_element(len.begin(), median_it , len.end());
auto median = *median_it;

For even numbers (size of vector) you need to be a bit more precise. E.g., you can use

assert(!len.empty());
if (len.size() % 2 == 0) {
    const auto median_it1 = len.begin() + len.size() / 2 - 1;
    const auto median_it2 = len.begin() + len.size() / 2;

    std::nth_element(len.begin(), median_it1 , len.end());
    const auto e1 = *median_it1;

    std::nth_element(len.begin(), median_it2 , len.end());
    const auto e2 = *median_it2;

    return (e1 + e2) / 2;

} else {
    const auto median_it = len.begin() + len.size() / 2;
    std::nth_element(len.begin(), median_it , len.end());
    return *median_it;
}

There are of course many different ways how we can get element e1. We could also use max or whatever we want. But this line is important because nth_element only places the nth element correctly, the remaining elements are ordered before or after this element, depending on whether they are larger or smaller. This range is unsorted.

This code is guaranteed to have linear complexity on average, i.e., O(N), therefore it is asymptotically better than sort, which is O(N log N).


Regarding your code:

    for (i=0; i<len.size(); i++){
        if (len[i]>len[i+1])

This will not work, as you access len[len.size()] in the last iteration which does not exist.

overseas
  • 1,552
  • 1
  • 16
  • 27
  • Nice; better than my answer. Upvoted especially now you've done the bit for even numbers. – Bathsheba Mar 14 '17 at 17:02
  • @steve Do you agree now? – overseas Mar 14 '17 at 17:03
  • I wouldn't write the code like this. Using a variable would be so much cleaner. In addition your statement about n log n is incorrect. It uses a partition algorithm, which guarantees o(n) run time complexity as it does not need to sort all the elements, only the nth element. – steve Mar 14 '17 at 17:11
  • You need to reread my argument. I say it is on average O(N), whil esort is O(N log N) which is exactly what you say - I'll try to clarify this point. Of course I could write it differently using a variable, but I don't think this is necessary because this is not the main focus of the question. – overseas Mar 14 '17 at 17:13
  • It is guaranteed to be O(n). – steve Mar 14 '17 at 17:16
  • For pre-C++17 It is guaranteed to be O(n) *on average*. Check section 25.4.2 in the standard (document n3690, [avg.nth.element]). It may be O(n) in most implementations, but there will be no guarantee. – overseas Mar 14 '17 at 17:23
  • 1
    I don't think that [nth_element](http://en.cppreference.com/w/cpp/algorithm/nth_element) is the correct choice, given that it doesn't guarantee that the first elements in the range would be sorted, but only that _"all of the elements before this new nth element are less than or equal to the elements after the new nth element."_. See [here](https://ideone.com/fDyNNi), for example. – Bob__ Mar 14 '17 at 17:55
  • Thanks @Bob__, very good point! I just edited my code. Do you agree now? Still i think that `nth_element` is the correct choice, because of asymptocally better performance. Given a reasonable large vector, this will be faster than `sort` very soon (I would be suprrized if there were a large constant in `nth_element`). – overseas Mar 14 '17 at 18:06
  • Yes, I agree. It should be faster too, on average. – Bob__ Mar 14 '17 at 18:17
  • For something that works in both even and odd cases, you have `auto index1 = (len.size() - 1) / 2; auto index2 = len.size() / 2;`. If they are equal, then the averaging step *could* be skipped, for efficiency, but the result is correct either way. Don't forget to check `.size() > 0` in any implementation! – Toby Speight Mar 15 '17 at 09:13
  • After the second call to `std::nth_element` , median_it1 might not be pointing to the original value. – Pasan W. Jun 23 '18 at 10:22
3
std::sort(len.begin(), len.end());
double median = len[len.size() / 2];

will do it. You might need to take the average of the middle two elements if size() is even, depending on your requirements:

0.5 * (len[len.size() / 2 - 1] + len[len.size() / 2]);
Bathsheba
  • 220,365
  • 33
  • 331
  • 451
  • not you might, you have to. This is how the median is defined. – steve Mar 14 '17 at 16:56
  • 2
    `std::nth_element` has better complexity than `std::sort`. – Jarod42 Mar 14 '17 at 17:01
  • Indeed, I'll leave this up anyway for good measure: have upvoted the other answer to set it on its way. – Bathsheba Mar 14 '17 at 17:02
  • I think that your solution is better then using [nth_element](http://en.cppreference.com/w/cpp/algorithm/nth_element) which doesn't guarantee that the first elements would be sorted, but only that _"all of the elements before this new nth element are less than or equal to the elements after the new nth element."_. See [here](https://ideone.com/fDyNNi), for example. – Bob__ Mar 14 '17 at 17:58
  • 1
    @Bob__ Is that not the requirement for the median? – pingul Mar 14 '17 at 18:01
  • @pingul As far as I know, _"A median is only defined on ordered one-dimensional data"_ (wiki) – Bob__ Mar 14 '17 at 18:12
1

Instead of trying to do everything at once, you should start with simple test cases and work upwards:

#include<vector>

double find_median(std::vector<double> len);

// Return the number of failures - shell interprets 0 as 'success',
// which suits us perfectly.
int main()
{
    return find_median({0, 1, 1, 2}) != 1;
}

This already fails with your code (even after fixing i to be an unsigned type), so you could start debugging (even 'dry' debugging, where you trace the code through on paper; that's probably enough here).

I do note that with a smaller test case, such as {0, 1, 2}, I get a crash rather than merely failing the test, so there's something that really needs to be fixed.

Let's replace the implementation with one based on overseas's answer:

#include <algorithm>
#include <limits>
#include <vector>

double find_median(std::vector<double> len)
{
    if (len.size() < 1)
        return std::numeric_limits<double>::signaling_NaN();

    const auto alpha = len.begin();
    const auto omega = len.end();

    // Find the two middle positions (they will be the same if size is odd)
    const auto i1 = alpha + (len.size()-1) / 2;
    const auto i2 = alpha + len.size() / 2;

    // Partial sort to place the correct elements at those indexes (it's okay to modify the vector,
    // as we've been given a copy; otherwise, we could use std::partial_sort_copy to populate a
    // temporary vector).
    std::nth_element(alpha, i1, omega);
    std::nth_element(i1, i2, omega);

    return 0.5 * (*i1 + *i2);
}

Now, our test passes. We can write a helper method to allow us to create more tests:

#include <iostream>
bool test_median(const std::vector<double>& v, double expected)
{
    auto actual = find_median(v);
    if (abs(expected - actual) > 0.01) {
        std::cerr << actual << " - expected " << expected << std::endl;
        return true;
    } else {
        std::cout << actual << std::endl;
        return false;
    }
}

int main()
{
    return test_median({0, 1, 1, 2}, 1)
        +  test_median({5}, 5)
        +  test_median({5, 5, 5, 0, 0, 0, 1, 2}, 1.5);
}

Once you have the simple test cases working, you can manage more complex ones. Only then is it time to create a large array of random values to see how well it scales:

#include <ctime>
#include <functional>
#include <random>

int main(int argc, char **argv)
{
    std::vector<double> foo;

    const int n = argc > 1 ? std::stoi(argv[1]) : 10;
    foo.reserve(n);

    std::mt19937 rand_generator(std::time(0));
    std::uniform_real_distribution<double> rand_distribution(0,0.8);
    std::generate_n(std::back_inserter(foo), n, std::bind(rand_distribution, rand_generator));

    std::cout << "Vector:";
    for (auto v: foo)
        std::cout << ' ' << v;

    std::cout << "\nMedian = " << find_median(foo) << std::endl;
}

(I've taken the number of elements as a command-line argument; that's more convenient in my build than reading it from cin). Notice that instead of allocating n doubles in the vector, we simply reserve capacity for them, but don't create any until needed.

For fun and kicks, we can now make find_median() generic. I'll leave that as an exercise; I suggest you start with:

typename<class Iterator>
auto find_median(Iterator alpha, Iterator omega)
{
    using value_type = typename Iterator::value_type;

    if (alpha == omega)
        return std::numeric_limits<value_type>::signaling_NaN();
}
Toby Speight
  • 23,550
  • 47
  • 57
  • 84
  • This might not appear to be an answer, but I think that a lesson in debugging will be more useful than merely spoon-feeding a correct implementation. I hope you agree, and find this useful! – Toby Speight Mar 14 '17 at 17:13
  • @Downvoter, what's wrong? Question is specifically asking for advice, not just working code. – Toby Speight Mar 15 '17 at 09:37
  • Very interesting. I see that you have used a signaling NaN of the type `value_type` in the (partial) generic function too. Wouldn't be better using the (maybe different) return type (or throwing an exception)? I mean, even if iterators to `int`s are passed to the function, one still probably expects a floating point number as return type and integers have no signaling NaNs. Also, I'm wondering if you used `std::time` to be closer to OP code or if you were more concerned about `std::random_device` beeing deterministic. – Bob__ Mar 16 '17 at 13:01
  • I kept `std::time` to be consistent with the question - `random_device` is probably more idiomatic. The choice to return a NaN from when passed an empty collection was fairly arbitrary, I'm afraid - throwing a `std::range_error` seems equally valid to me. – Toby Speight Mar 16 '17 at 13:52