0

If I do a standard deviation calculation for a sample using this code modified somewhat from this SO question:

public double CalculateStandardDeviation(List<double> values, bool sample = false)
    {
        double mean = 0.0;
        double sum = 0.0;
        double stdDev = 0.0;
        int count = 0;
        foreach (double val in values)
        {
            count++;
            double delta = val - mean;
            mean += delta / count;
            sum += delta * (val - mean);
        }
        if (1 < count)
            stdDev = Math.Sqrt(sum / (count - (sample ? 1 : 0)));
        return stdDev;
    }

Using this unit test:

    [Test]
    public void Sample_Standard_Deviation_Returns_Expected_Value()
    {
        //original cite: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
        double expected = 2.23606797749979;
        double tolerance = 1.0 / System.Math.Pow(10, 13);
        var cm = new CommonMath();//a library of math functions we use a lot
        List<double> values = new List<double> { 4.0, 2.0, 5.0, 8.0, 6.0 };
        double actual = cm.CalculateStandardDeviation(values, true);
        Assert.That(actual, Is.EqualTo(expected).Within(tolerance));
    }

The test passes with a resultant value within the specified tolerance.

However, if I use this Linq-ified code, it fails, returning a value of 2.5 (as if it were a population standard deviation instead):

        double meanOfValues = values.Average();
        double sumOfValues = values.Sum();
        int countOfValues = values.Count;
        double standardDeviationOfValues = 
            Math.Sqrt(sumOfValues / (countOfValues - (sample ? 1 : 0)));

        return standardDeviationOfValues;

As I've never taken statistics (so please be gentle), the Linq-ification (that's a word) of the values from the list seem like they should give me the same results, but they don't and I don't understand what I've done wrong. The action of deciding between N & N-1 is the same in both, so why isn't the answer the same?

delliottg
  • 3,534
  • 1
  • 33
  • 45
  • What's sample here for? – Herbert Yu May 23 '17 at 21:07
  • 1
    You missed a step: `var standardDeviationOfValues = Math.Sqrt(values.Select(v=>Math.Pow(v - meanOfValues,2)).Average());` or you can not use `sumOfValues` but `sumOfDeltasSquared`, which you haven't calculated. In either case, your current formula when sample is false calculates the average/mean, not the standard deviation. – Robert McKee May 23 '17 at 21:49
  • Ah, thanks for the insight, this wasn't as simple as I thought it was. – delliottg May 24 '17 at 14:06

3 Answers3

1

Your LINQ version does not compute Standard Deviation. Standard Deviation is based on the sum of the square of the differences from the mean, so change to:

double meanOfValues = values.Average();
double sumOfValues = values.Select(v => (v-meanOfValues)*(v-meanOfValues)).Sum();
int countOfValues = values.Count;
double standardDeviationOfValues =
    Math.Sqrt(sumOfValues / (countOfValues - (sample ? 1 : 0)));

return standardDeviationOfValues;

To traverse values one time, you can use Aggregate but it isn't better than a normal function:

var g = values.Aggregate(new { mean = 0.0, sum = 0.0, count = 0 },
            (acc, val) => {
                var newcount = acc.count+1;
                double delta = val-acc.mean;
                var newmean = acc.mean + delta / newcount;
                return new { mean = newmean, sum = acc.sum+delta*(val-newmean), count = newcount };
         });
var stdDev = Math.Sqrt(g.sum / (g.count - (sample ? 1 : 0)));
NetMage
  • 22,242
  • 2
  • 28
  • 45
  • Nicely done, all my unit tests for both population & sample are working with your modifications, thank you! – delliottg May 24 '17 at 14:06
  • 1
    Note that the LINQ version is not as efficient as it traverses `values` multiple times. – NetMage May 24 '17 at 19:33
  • Thanks. Is there anything that can be done to reduce the traversals? Or is it intrinsic to how LINQ handles data? – delliottg May 24 '17 at 23:04
  • Only if you replace this all with an `Aggregate` in which case you are just doing your other code in a slightly less efficient way. I'll add it to the answer. – NetMage May 24 '17 at 23:20
0

Put sample as false, and you get the same answer: 2.23606797749979 If you put sample as true, you get 2.5!

So, you do need put the same "sample" value in both places.

Herbert Yu
  • 509
  • 5
  • 12
  • The idea is I can calculate either a sample (n) or population (n-1) standard deviation using the same code. The ternary expression makes the determination depending on the value of "sample" passed in. I'm expecting to get the same answer from both pieces of code. – delliottg May 23 '17 at 21:15
0

Lets start with that

values.Sum();

and sum you're getting from

sum += delta * (val - mean);

are not the same.

Next time you could start by going with TDD on that kind of problem and checking every value that way.

EDIT: Standard Deviation in LINQ

  • Your link is to the same SO question linked in my question, the original iterative code is derived from that question. I see your point about the differences in the sums, the iterative process takes the square root of 5, where the LINQ process is taking the square root of 6.25. – delliottg May 23 '17 at 21:29