11

So, I've posted a few times and previously my problems were pretty vague. I started C++ this week and have been doing a little project.

I'm trying to calculate standard deviation & variance. My code loads a file of 100 integers and puts them into an array, counts them, calculates the mean, sum, variance and SD. But I'm having a little trouble with the variance.

I keep getting a huge number - I have a feeling it's to do with its calculation.

My mean and sum are ok.

NB:

sd & mean calcs

 using namespace std;
    int main()

{

int n = 0;
int Array[100];
float mean;
float var;
float sd;
string line;
float numPoints;

ifstream myfile("numbers.txt");

if (myfile.is_open())

{
    while (!myfile.eof())
      
    {
        getline(myfile, line);
            
        stringstream convert(line);
        
        if (!(convert >> Array[n]))
        
        {
            Array[n] = 0;
        }
        cout << Array[n] << endl;
        
        n++;
        
    }
    
    myfile.close();

    numPoints = n;

}
else cout<< "Error loading file" <<endl;

int sum = accumulate(begin(Array), end(Array), 0, plus<int>());

cout << "The sum of all integers: " << sum << endl;

mean = sum/numPoints;

cout << "The mean of all integers: " << mean <<endl;

var = ((Array[n] - mean) * (Array[n] - mean)) / numPoints;

sd = sqrt(var);

cout << "The standard deviation is: " << sd <<endl;

return 0;

}
Adrian Mole
  • 30,672
  • 69
  • 32
  • 52
Jack
  • 261
  • 2
  • 5
  • 12
  • 1
    In `(Array[n] - mean)` isn't `n` one more than the number of elements you have read? Also, [`while (!myfile.eof())` is almost always wrong](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) – Bo Persson Oct 21 '15 at 20:35
  • 1
    You should use double instead of float – FredK Oct 21 '15 at 20:47

6 Answers6

14

As the other answer by horseshoe correctly suggests, you will have to use a loop to calculate variance otherwise the statement

var = ((Array[n] - mean) * (Array[n] - mean)) / numPoints;

will just consider a single element from the array.

Just improved horseshoe's suggested code:

var = 0;
for( n = 0; n < numPoints; n++ )
{
  var += (Array[n] - mean) * (Array[n] - mean);
}
var /= numPoints;
sd = sqrt(var);

Your sum works fine even without using loop because you are using accumulate function which already has a loop inside it, but which is not evident in the code, take a look at the equivalent behavior of accumulate for a clear understanding of what it is doing.

Note: X ?= Y is short for X = X ? Y where ? can be any operator. Also you can use pow(Array[n] - mean, 2) to take the square instead of multiplying it by itself making it more tidy.

Ahmed Akhtar
  • 1,360
  • 1
  • 14
  • 26
  • 1
    thanks for the 'Note' it was useful. compare your code to horseshoe why is the for statement better than the while? or is there no real difference? – Jack Oct 22 '15 at 11:32
  • 2
    @jack technically there is no difference between the **for** and the **while** loops (except syntax), but usually when you need: (1) initialization of a variable before starting the loop, (2) an increment in the variable at the end of the loop and then (3) want to check for a condition to reiterate; then **for** makes the code much more readable and also ensures that you don't forget any of the three. – Ahmed Akhtar Oct 23 '15 at 03:56
  • Am I missing something? var /= (numPoints-1) , not / numPoints – WurmD Jun 07 '19 at 11:15
  • @WurmD Why do you think it should be divided by `numPoints - 1` and not by `numPoints`? – Ahmed Akhtar Jun 18 '19 at 13:53
  • Look at the other responses, half of them are size-1 @AhmedAkhtar – WurmD Jun 19 '19 at 21:25
  • @WurmD The `N` in the formula of variance means the number of observations, which is `numPoints` in our case, not `numPoints-1` – Ahmed Akhtar Jun 24 '19 at 11:35
  • 1
    Usually you divide by the number of points subtracted by 1 to provide an unbiased estimate of the variance. https://stats.stackexchange.com/q/100041/86678 – rayryeng Sep 27 '19 at 04:28
  • 1
    @rayryeng Thanks for the explanation to why `numPoints-1` could be used. However, I used just `numPoints` because it was in line with the formula posted by the OP. But thanks again for clarifying. – Ahmed Akhtar Sep 28 '19 at 05:13
8

Here's another approach using std::accumulate but without using pow. In addition, we can use an anonymous function to define how to calculate the variance after we calculate the mean. Note that this computes the unbiased sample variance.

#include <vector>
#include <algorithm>
#include <numeric>

template<typename T>
T variance(const std::vector<T> &vec) {
    const size_t sz = vec.size();
    if (sz == 1) {
        return 0.0;
    }

    // Calculate the mean
    const T mean = std::accumulate(vec.begin(), vec.end(), 0.0) / sz;

    // Now calculate the variance
    auto variance_func = [&mean, &sz](T accumulator, const T& val) {
        return accumulator + ((val - mean)*(val - mean) / (sz - 1));
    };

    return std::accumulate(vec.begin(), vec.end(), 0.0, variance_func);
}

A sample of how to use this function:

#include <iostream>
int main() {
    const std::vector<double> vec = {1.0, 5.0, 6.0, 3.0, 4.5};
    std::cout << variance(vec) << std::endl;
}
rayryeng
  • 96,704
  • 21
  • 166
  • 177
1

Your variance calculation is outside the loop and thus it is only based on the n== 100 value. You need an additional loop.

You need:

var = 0;
n=0;
while (n<numPoints){
   var = var + ((Array[n] - mean) * (Array[n] - mean));
   n++;
}
var /= numPoints;
sd = sqrt(var);
horseshoe
  • 1,119
  • 8
  • 32
1

Two simple methods to calculate Standard Deviation & Variance in C++.

#include <math.h>
#include <vector>

double StandardDeviation(std::vector<double>);
double Variance(std::vector<double>);

int main()
{
     std::vector<double> samples;
     samples.push_back(2.0);
     samples.push_back(3.0);
     samples.push_back(4.0);
     samples.push_back(5.0);
     samples.push_back(6.0);
     samples.push_back(7.0);

     double std = StandardDeviation(samples);
     return 0;
}

double StandardDeviation(std::vector<double> samples)
{
     return sqrt(Variance(samples));
}

double Variance(std::vector<double> samples)
{
     int size = samples.size();

     double variance = 0;
     double t = samples[0];
     for (int i = 1; i < size; i++)
     {
          t += samples[i];
          double diff = ((i + 1) * samples[i]) - t;
          variance += (diff * diff) / ((i + 1.0) *i);
     }

     return variance / (size - 1);
}
D.Zadravec
  • 627
  • 3
  • 7
  • Do you have a reference for that approach? Is it this one? https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm – Tilman Vogel Jun 16 '20 at 11:20
0

Rather than writing out more loops, you can create a function object to pass to std::accumulate to calculate the mean.

template <typename T>
struct normalize {
    T operator()(T initial, T value) {
        return initial + pow(value - mean, 2);
    }
    T mean;
}

While we are at it, we can use std::istream_iterator to do the file loading, and std::vector because we don't know how many values there are at compile time. This gives us:

int main()
{
    std::vector<int> values; // initial capacity, no contents yet

    ifstream myfile(“numbers.txt");
    if (myfile)
    {
        values.assign(std::istream_iterator<int>(myfile), {});
    }
    else { std::cout << "Error loading file" << std::endl; }

    float sum = std::accumulate(values.begin(), values.end(), 0, plus<int>()); // plus is the default for accumulate, can be omitted
    std::cout << "The sum of all integers: " << sum << std::endl;
    float mean = sum / values.size();
    std::cout << "The mean of all integers: " << mean << std::endl;
    float var = std::accumulate(values.begin(), values.end(), 0, normalize<float>{ mean }) / values.size();
    float sd = sqrt(var);
    std::cout << "The standard deviation is: " << sd << std::endl;
    return 0;
}
Caleth
  • 35,377
  • 2
  • 31
  • 53
0
#include <iostream>
#include <numeric>
#include <vector>
#include <cmath>
#include <utility>
#include <array>

template <class InputIterator, class T>
void Mean(InputIterator first, InputIterator last, T& mean) {
  int n = std::distance(first, last);
  mean = std::accumulate(first, last, static_cast<T>(0)) / n;
}

template <class InputIterator, class T>
void StandardDeviation(InputIterator first, InputIterator last, T& mean, T& stardard_deviation) {
  int n = std::distance(first, last);
  mean = std::accumulate(first, last, static_cast<T>(0)) / n;
  T s = std::accumulate(first, last, static_cast<T>(0), [mean](double x, double y) {
    T denta = y - mean;
    return x + denta*denta;
  });
  stardard_deviation = s/n;
}

int main () {
  std::vector<int> v = {10, 20, 30};

  double mean = 0;
  Mean(v.begin(), v.end(), mean);
  std::cout << mean << std::endl;

  double stardard_deviation = 0;
  StandardDeviation(v.begin(), v.end(), mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;

  double a[3] = {10.5, 20.5, 30.5};
  Mean(a, a+3, mean);
  std::cout << mean << std::endl;
  StandardDeviation(a, a+3, mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;

  std::array<int, 3> m = {1, 2, 3};
  Mean(m.begin(), m.end(), mean);
  std::cout << mean << std::endl;
  StandardDeviation(m.begin(), m.end(), mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;
  return 0;
}
  • While this code may provide a solution to the question, it's better to add context as to why/how it works. This can help future users learn, and apply that knowledge to their own code. You are also likely to have positive feedback from users in the form of upvotes, when the code is explained. – borchvm Aug 11 '20 at 10:53
  • Thank you, my code has a problem with performance. I fixed it. I hope it will better than. With the source code, I am still not satisfied with it. When I want compute mean and standard-deviation then the mean function is repeated 2 times. – manh duong Aug 11 '20 at 17:52