66

I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:

namespace ValveTesterInterface
{
    public class ValveDataResults
    {
        private List<ValveData> m_ValveResults;

        public ValveDataResults()
        {
            if (m_ValveResults == null)
            {
                m_ValveResults = new List<ValveData>();
            }
        }

        public void AddValveData(ValveData valve)
        {
            m_ValveResults.Add(valve);
        }

Here is the function where the standard deviation needs to be calculated:

        public float LatchStdev()
        {

            float sumOfSqrs = 0;
            float meanValue = 0;
            foreach (ValveData value in m_ValveResults)
            {
                meanValue += value.LatchTime;
            }
            meanValue = (meanValue / m_ValveResults.Count) * 0.02f;

            for (int i = 0; i <= m_ValveResults.Count; i++) 
            {   
                sumOfSqrs += Math.Pow((m_ValveResults - meanValue), 2);  
            }
            return Math.Sqrt(sumOfSqrs /(m_ValveResults.Count - 1));

        }
    }
}

Ignore whats inside the LatchStdev() function because I'm sure its not right. Its just my poor attempt to calculate the st dev. I know how to do it of a list of doubles, however not of a list of generic data list. If someone had experience in this, please help.

Brian Webster
  • 27,545
  • 47
  • 143
  • 218
Tom Hangler
  • 713
  • 1
  • 5
  • 7

4 Answers4

176

The example above is slightly incorrect and could have a divide by zero error if your population set is 1. The following code is somewhat simpler and gives the "population standard deviation" result. (http://en.wikipedia.org/wiki/Standard_deviation)

using System;
using System.Linq;
using System.Collections.Generic;

public static class Extend
{
    public static double StandardDeviation(this IEnumerable<double> values)
    {
        double avg = values.Average();
        return Math.Sqrt(values.Average(v=>Math.Pow(v-avg,2)));
    }
}
Jonathan DeMarks
  • 2,085
  • 2
  • 14
  • 14
  • 4
    This one should be the answer, it calculates Standard Deviation as opposed to the answer by LBushkin which really calculates Sample Standard Deviation – Wouter Jun 21 '12 at 10:48
  • +1 This is the actual Standard Deviation (aka population standard deviation) as opposed to Sample Standard Deviation in LBushkin's answer. – Levitikon Apr 15 '16 at 12:45
  • 4
    return Math.Sqrt(values.Average(v=> (v-avg) * (v-avg))); is 3.37x faster on my machine. Math.Pow() is much slower than normal multiplication. – BlueSky Jun 18 '19 at 00:10
  • @BlueSky Thanks for doing the benchmark! I love having both options available to see clearly. Math.Pow() might be a bit more readable but your code is more performant, so folks can choose what is right for their scenario. – Jonathan DeMarks Jun 18 '19 at 13:37
  • 2
    From mathmatic, this is the the right answer. However you should definatly avoid using this code in production: the parameter is IEnumerable, with this code, the IEnumerable will be invoked twice. Take a good sample, what if the this function is invoked on a EF query? Best way is check if this IEnumreable can bel cast to a collection, if not, do a .ToList() first. – Steven.Xi Dec 24 '20 at 10:19
74

This article should help you. It creates a function that computes the deviation of a sequence of double values. All you have to do is supply a sequence of appropriate data elements.

The resulting function is:

private double CalculateStandardDeviation(IEnumerable<double> values)
{   
  double standardDeviation = 0;

  if (values.Any()) 
  {      
     // Compute the average.     
     double avg = values.Average();

     // Perform the Sum of (value-avg)_2_2.      
     double sum = values.Sum(d => Math.Pow(d - avg, 2));

     // Put it all together.      
     standardDeviation = Math.Sqrt((sum) / (values.Count()-1));   
  }  

  return standardDeviation;
}

This is easy enough to adapt for any generic type, so long as we provide a selector for the value being computed. LINQ is great for that, the Select funciton allows you to project from your generic list of custom types a sequence of numeric values for which to compute the standard deviation:

List<ValveData> list = ...
var result = list.Select( v => (double)v.SomeField )
                 .CalculateStdDev();
Bern
  • 7,102
  • 5
  • 31
  • 46
LBushkin
  • 121,016
  • 31
  • 208
  • 258
  • my c# doesnt have an AVERAGE. It doesnt show up. Thats one of my problems. Also I cannot pass a generic list through my function as a parameters. The mean needs to be implemented inside the stdevmethod like my code above. My standard deviation is off tho. – Tom Hangler Jun 29 '10 at 14:42
  • Also guys. C# doesn't have the average (Math.average). So i calculate the mean myself like my code above. Its the standard deviation that I have the most trouble with. Thanks – Tom Hangler Jun 29 '10 at 14:43
  • 2
    @Tom Hangler, make sure you add `using System.Linq;` at the top of your file to include the library of LINQ functions. THese include both `Average()` and `Select()` – LBushkin Jun 29 '10 at 14:43
  • oh ok thanks. Im sorry I'm a noob. I dont think that visual studio recognizes system.ling. Also what is the v=> and the d=> stand for? also should all the code you gave me be in my one standarddeviation function? thanks – Tom Hangler Jun 29 '10 at 14:51
  • It's a 'Q' not a 'G' at the end of System.Linq. I assumed you're using .NET 3.5, if not, then you will not have access to LINQ, and a slightly different solution would be appropriate. – LBushkin Jun 29 '10 at 14:53
  • The `v=>` and `d=>` syntax (and what follows) creates a lambda expression - essentially an anonymous function that accepts a parameter `v` or `v` (respectively) and uses that to compute some result. You can read more about them here: http://msdn.microsoft.com/en-us/library/bb397687.aspx – LBushkin Jun 29 '10 at 14:55
  • 12
    Take note that this algorithm implements Sample Standard Deviation as opposed to "plain" Standard Deviation. – Jesse C. Slicer Jun 29 '10 at 15:49
  • 11
    the `if(values.Count()>0)` line should probably check for > 1, since you're dividing by `values.Count() - 1`. – tenpn Jul 05 '11 at 07:42
  • 4
    For much faster performance (3.37x on my machine), multiply the terms instead of using Math.Pow: (d - avg) * (d - avg) instead of: Math.Pow(d - avg, 2) – BlueSky Oct 05 '18 at 19:23
  • 2
    double sum = values.Sum(d => (d - avg) * (d - avg)); – BlueSky Oct 05 '18 at 19:33
  • When all values are equal to the mean, the standard deviation will be zero. In this case shouldn't `ret` be assigned an invalid value such as -1 at first to indicate when the standard deviation could not be calculated? Otherwise, there is the (admittedly very rare) possibility of returning a false negative since zero is a valid result. – Aric Jul 24 '19 at 12:07
  • After more thought, returning zero for an empty population could work, but it may be useful to indicate that there was no data in the return value. – Aric Jul 24 '19 at 12:15
  • Same as my comment below, avoid iterate IEnumerable multiple times in an helper/extension function. As you never know where is this IEnumerable coming from. It could from a db query, which iterate multiple times will result duplicated db read. Cast / convert to a collection before iterate it pls. – Steven.Xi Dec 24 '20 at 10:22
25

Even though the accepted answer seems mathematically correct, it is wrong from the programming perspective - it enumerates the same sequence 4 times. This might be ok if the underlying object is a list or an array, but if the input is a filtered/aggregated/etc linq expression, or if the data is coming directly from the database or network stream, this would cause much lower performance.

I would highly recommend not to reinvent the wheel and use one of the better open source math libraries Math.NET. We have been using that lib in our company and are very happy with the performance.

PM> Install-Package MathNet.Numerics

var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();

var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();

See http://numerics.mathdotnet.com/docs/DescriptiveStatistics.html for more information.

Lastly, for those who want to get the fastest possible result and sacrifice some precision, read "one-pass" algorithm https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

Chris Marisic
  • 30,638
  • 21
  • 158
  • 255
Yuri Astrakhan
  • 6,730
  • 5
  • 50
  • 72
0

I see what you're doing, and I use something similar. It seems to me you're not going far enough. I tend to encapsulate all data processing into a single class, that way I can cache the values that are calculated until the list changes. for instance:

public class StatProcessor{
private list<double> _data; //this holds the current data
private _avg; //we cache average here
private _avgValid; //a flag to say weather we need to calculate the average or not
private _calcAvg(); //calculate the average of the list and cache in _avg, and set _avgValid
public double average{
     get{
     if(!_avgValid) //if we dont HAVE to calculate the average, skip it
        _calcAvg(); //if we do, go ahead, cache it, then set the flag.
     return _avg; //now _avg is garunteed to be good, so return it.
     }
}
...more stuff
Add(){
//add stuff to the list here, and reset the flag
}
}

You'll notice that using this method, only the first request for average actually computes the average. After that, as long as we don't add (or remove, or modify at all, but those arnt shown) anything from the list, we can get the average for basically nothing.

Additionally, since the average is used in the algorithm for the standard deviation, computing the standard deviation first will give us the average for free, and computing the average first will give us a little performance boost in the standard devation calculation, assuming we remember to check the flag.

Furthermore! places like the average function, where you're looping through every value already anyway, is a great time to cache things like the minimum and maximum values. Of course, requests for this information need to first check whether theyve been cached, and that can cause a relative slowdown compared to just finding the max using the list, since it does all the extra work setting up all the concerned caches, not just the one your accessing.

Benjamin
  • 882
  • 5
  • 14