84

Does LINQ model the aggregate SQL function STDDEV() (standard deviation)?

If not, what is the simplest / best-practices way to calculate it?

Example:

  SELECT test_id, AVERAGE(result) avg, STDDEV(result) std 
    FROM tests
GROUP BY test_id
Steven
  • 12,263
  • 24
  • 94
  • 137
  • 2
    possible duplicate of [How do I determine the standard deviation (stddev) of a set of values?](http://stackoverflow.com/questions/895929/how-do-i-determine-the-standard-deviation-stddev-of-a-set-of-values) – nawfal Apr 22 '13 at 18:22
  • @Steven, you may like to revisit the accepted answer here. There are issues with the currently selected approach that people who don't scroll down and read further might not see. – Drew Noakes Jun 19 '15 at 14:17
  • Why would anyone want to do this using *LINQ*? – Ant_222 Jan 04 '18 at 19:53

8 Answers8

103

You can make your own extension calculating it

public static class Extensions
{
    public static double StdDev(this IEnumerable<double> values)
    {
       double ret = 0;
       int count = values.Count();
       if (count  > 1)
       {
          //Compute the Average
          double avg = values.Average();

          //Perform the Sum of (value-avg)^2
          double sum = values.Sum(d => (d - avg) * (d - avg));

          //Put it all together
          ret = Math.Sqrt(sum / count);
       }
       return ret;
    }
}

If you have a sample of the population rather than the whole population, then you should use ret = Math.Sqrt(sum / (count - 1));.

Transformed into extension from Adding Standard Deviation to LINQ by Chris Bennett.

jww
  • 83,594
  • 69
  • 338
  • 732
Dynami Le Savard
  • 4,760
  • 2
  • 23
  • 22
  • 3
    I'd make that test "values.Count() > 1", because if it's exactly 1 you'll have a divide by zero error when you calculate the return value. – duffymo Feb 12 '10 at 17:59
  • 3
    Math.pow(d-avg, 2)? I'd skip the function call and use (d-avg)*(d-avg) – duffymo Feb 12 '10 at 18:00
  • How about adding a usage example. Thanks! – O.O Jan 07 '11 at 21:30
  • 2
    The line ret = Math.Sqrt((sum) / values.Count()-1); is missing parentheses around values.Count()-1, it should be ret = Math.Sqrt(sum / (values.Count()-1)); – Alex Peck Jun 22 '11 at 08:01
  • 1
    I was searching for this and it took me a while to figure out how to use the extension, but here is the way to apply the methods given above: `stdev = g.Select(o => o.number).StdDev()`. – Andrew Mao Aug 03 '12 at 22:30
  • 2
    Downside of this is that evaluates the input twice which is discouraged for `IEnumerable` parameters. – CodesInChaos Dec 17 '13 at 18:20
  • 2
    @Yevgeniy Rozhkov - Why did you remove the `- 1`? According to [this](http://www.excelfunctions.net/Excel-Stdev-S-Function.html) the `- 1` is required. – John Mills Jan 06 '14 at 04:23
  • Is there a way to do this similar to a moving average as in: http://stackoverflow.com/a/23493727/75129 – zumalifeguard Jul 15 '16 at 15:21
66

Dynami's answer works but makes multiple passes through the data to get a result. This is a single pass method that calculates the sample standard deviation:

public static double StdDev(this IEnumerable<double> values)
{
    // ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
    double mean = 0.0;
    double sum = 0.0;
    double stdDev = 0.0;
    int n = 0;
    foreach (double val in values)
    {
        n++;
        double delta = val - mean;
        mean += delta / n;
        sum += delta * (val - mean);
    }
    if (1 < n)
        stdDev = Math.Sqrt(sum / (n - 1));

    return stdDev;
}

This is the sample standard deviation since it divides by n - 1. For the normal standard deviation you need to divide by n instead.

This uses Welford's method which has higher numerical accuracy compared to the Average(x^2)-Average(x)^2 method.

Community
  • 1
  • 1
David Clarke
  • 12,002
  • 8
  • 80
  • 105
  • 2
    You may not have iterated the entire sequence more than once, but your method will still make two calls to GetEnumerator (which could be triggering a complex SQL query). Why not skip the condition and check n at the end of the loop? – Gideon Engelberth May 21 '10 at 23:47
  • Thanks Gideon, removes a level of nesting too. You're correct about the SQL, it's not relevant to what I'm working on so I hadn't considered the implication. – David Clarke May 23 '10 at 22:42
  • 4
    You're missing a definition of n. Also it should be noted that dividing the sum by (n-1) instead of n makes this a sample standard deviation – Neil Dec 07 '11 at 15:32
  • Thanks Neil, added declaration for n and your point about sample standard deviation is noted. – David Clarke Dec 07 '11 at 20:28
  • 3
    To make this more carefully replicate the SQL method, I changed `this IEnumerable values` and `val in values.Where(val => val != null)`. Also, I will note that this method (Welford's method) is more accurate and faster than the method above. – Andrew Mao Aug 06 '12 at 18:58
  • 2
    I've edited your answer to make it clear that you're computing the **sample standard deviation**, not the normal **standard deviation**. – CodesInChaos Dec 17 '13 at 19:03
32

This converts David Clarke's answer into an extension that follows the same form as the other aggregate LINQ functions like Average.

Usage would be: var stdev = data.StdDev(o => o.number)

public static class Extensions
{
    public static double StdDev<T>(this IEnumerable<T> list, Func<T, double> values)
    {
        // ref: https://stackoverflow.com/questions/2253874/linq-equivalent-for-standard-deviation
        // ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/ 
        var mean = 0.0;
        var sum = 0.0;
        var stdDev = 0.0;
        var n = 0;
        foreach (var value in list.Select(values))
        {
            n++;
            var delta = value - mean;
            mean += delta / n;
            sum += delta * (value - mean);
        }
        if (1 < n)
            stdDev = Math.Sqrt(sum / (n - 1));

        return stdDev; 

    }
} 
Community
  • 1
  • 1
Will Mathies
  • 321
  • 3
  • 2
  • 1
    Note that `Average`/`Min`/`Max`/etc have overloads with and without selector functions. They also have overloads for integral types, float, etc. – Drew Noakes Jun 19 '15 at 14:25
5
var stddev = Math.Sqrt(data.Average(z=>z*z)-Math.Pow(data.Average(),2));
Vitas
  • 233
  • 4
  • 10
2

Straight to the point (and C# > 6.0), Dynamis answer becomes this:

    public static double StdDev(this IEnumerable<double> values)
    {
        var count = values?.Count() ?? 0;
        if (count <= 1) return 0;

        var avg = values.Average();
        var sum = values.Sum(d => Math.Pow(d - avg, 2));

        return Math.Sqrt(sum / count);
    }

Edit 2020-08-27:

I took @David Clarke comments to make some performance tests and this are the results:

    public static (double stdDev, double avg) StdDevFast(this List<double> values)
    {
        var count = values?.Count ?? 0;
        if (count <= 1) return (0, 0);

        var avg = GetAverage(values);
        var sum = GetSumOfSquareDiff(values, avg);

        return (Math.Sqrt(sum / count), avg);
    }

    private static double GetAverage(List<double> values)
    {
        double sum = 0.0;
        for (int i = 0; i < values.Count; i++) 
            sum += values[i];
        
        return sum / values.Count;
    }
    private static double GetSumOfSquareDiff(List<double> values, double avg)
    {
        double sum = 0.0;
        for (int i = 0; i < values.Count; i++)
        {
            var diff = values[i] - avg;
            sum += diff * diff;
        }
        return sum;
    }

I tested this with a list of one million random doubles
the original implementation had an runtime of ~48ms
the performance optimized implementation 2-3ms
so this is an significant improvement.

Some interesting details:
getting rid of Math.Pow brings a boost of 33ms!
List instead of IEnumerable 6ms
manually Average calculation 4ms
For-loops instead of ForEach-loops 2ms
Array instead of List brings just an improvement of ~2% so i skipped this
using single instead of double brings nothing

Further lowering the code and using goto (yes GOTO... haven't used this since the 90s assembler...) instead of for-loops does not pay, Thank goodness!

I have tested also parallel calculation, this makes sense on list > 200.000 items It seems that Hardware and Software needs to initialize a lot and this is for small lists contra-productive.

All tests were executed two times in a row to get rid of the warmup-time.

  • Be aware this makes multiple passes through the data when evaluating `Count()`, `Average()`, and `Sum()`. That's ok for small values of `count` but has potential to impact performance if `count` is large. – David Clarke May 28 '20 at 20:44
  • @ david, so the simplest solution in my opinion would be to replace the signature with `(this IList values)`, performance tests would show the impact, and how many items make an significant difference – Ernst Greiner Jun 15 '20 at 10:20
  • Yeah that doesn't solve the issue - those extension methods (`Count`, `Average`, `Sum`) each iterate the collection so you still have three full iterations to produce a result. – David Clarke Jun 15 '20 at 20:38
0
public static double StdDev(this IEnumerable<int> values, bool as_sample = false)
{
    var count = values.Count();
    if (count > 0) // check for divide by zero
    // Get the mean.
    double mean = values.Sum() / count;

    // Get the sum of the squares of the differences
    // between the values and the mean.
    var squares_query =
        from int value in values
        select (value - mean) * (value - mean);
    double sum_of_squares = squares_query.Sum();
    return Math.Sqrt(sum_of_squares / (count - (as_sample ? 1 : 0)))
}
David Clarke
  • 12,002
  • 8
  • 80
  • 105
duc14s
  • 161
  • 3
  • 3
  • Note this is still making multiple passes through the data - ok if a small dataset but not good for large values of `count`. – David Clarke May 10 '17 at 04:04
0

Simple 4 lines, I used a List of doubles but one could use IEnumerable<int> values

public static double GetStandardDeviation(List<double> values)
{
    double avg = values.Average();
    double sum = values.Sum(v => (v - avg) * (v - avg));
    double denominator = values.Count - 1;
    return denominator > 0.0 ? Math.Sqrt(sum / denominator) : -1;
}
Baddack
  • 1,548
  • 20
  • 30
0

In general case we want to compute StdDev in one pass: what if values is file or RDBMS cursor which can be changed between computing average and sum? We are going to have inconsistent result. The code below uses just one pass:

// Population StdDev
public static double StdDev(this IEnumerable<double> values) {
  if (null == values)
    throw new ArgumentNullException(nameof(values));

  double N = 0;
  double Sx = 0.0;
  double Sxx = 0.0;

  foreach (double x in values) {
    N += 1;
    Sx += x;
    Sxx += x * x;
  }

  return N == 0
    ? double.NaN // or throw exception
    : Math.Sqrt((Sxx - Sx * Sx / N) / N);
}

The very same idea for sample StdDev:

// Sample StdDev
public static double StdDev(this IEnumerable<double> values) {
  if (null == values)
    throw new ArgumentNullException(nameof(values));

  double N = 0;
  double Sx = 0.0;
  double Sxx = 0.0;

  foreach (double x in values) {
    N += 1;
    Sx += x;
    Sxx += x * x;
  }

  return N <= 1
    ? double.NaN // or throw exception
    : Math.Sqrt((Sxx - Sx * Sx / N) / (N - 1));
}
Dmitry Bychenko
  • 149,892
  • 16
  • 136
  • 186