Refactoring LINQ methods

Question

I have a function that takes a DataTable as a parameter and returns an object of type NormalData for each column in the data table NormalData definition

public class NormalData
{
    //AttributeName = ColumnName of DataTable
    public string AttributeName { get; set; }

    //each column will have its mean and standard deviation computed
    public double Mean { get; set; }
    public double StandardDeviation { get; set; }

    //a DataTable with three columns will create an IEnumerable<NormalData>
    //with a count of three
}

The following works, but I would like a second opinion on how I implemented it:

public static IEnumerable<NormalData>  GetNormalDataByTableColumns(DataTable dt)
{
    //get list of column names to iterate over
    List<string> columnList = GetDataTableColumnNames(dt);
    List<NormalData> normalDataList = new List<NormalData>();
    for (int i = 0; i < columnList.Count; i++)
    {
        //creates a NormalData object for each column in the DataTable
        NormalData normalData = new NormalData();

        //find average
        normalData.Mean = GetColumnAverage(dt, columnList[i]);

        //find stDev
        normalData.StandardDeviation = GetColumnStDev(dt,columnList[i],normalData.Mean);
        normalData.AttributeName = columnList[i];

        //add to NormalDataList
        normalDataList.Add(normalData);
    }

    return normalDataList;
}

private static List<string> GetDataTableColumnNames(DataTable dt)
{
    return (from DataColumn dc in dt.Columns
        select dc.ColumnName).ToList();
}

private static double GetColumnAverage(DataTable dt, string columnName)
{
    return dt.AsEnumerable().Average(x => x.Field<double>(columnName));
}

private static double GetColumnStDev(DataTable dt, string columnName,double average)
{
    var squaredDiffs = (dt.AsEnumerable()
        .Sum(x => (x.Field<double>(columnName) - average) *
        x.Field<double>(columnName) - average));

    return Math.Sqrt(squaredDiffs / dt.Rows.Count);
}

What I feel is poor design is the parameter list that GetColumnAverage and GetColumnStDev are required to take. In reality, they should only need a list of numeric types (not necessarily double, but it's hardcoded at the moment) to compute their values. This is, however, the only way I've gotten this to work this morning. What are some rules that I'm breaking here in this design? How can I amend this so that the GetColumn.. functions only take the DataColumn that's being iterated over in the for loop of columnList?

EDIT: the average variable is changed for each column and cannot be reused. Or is it possible that this is OK design and I need to have overloaded versions of these methods if I don't need to compute the standard deviation and yes, only the average?

This question appears to be off-topic because it is asking for a code review, rather than asking a specific programming question. — Servy, Dec 17 '13 at 17:24
@Servy Of course I'm biased, but I disagree. I believe that there's a fundamental concept of factoring LINQ methods that I'm not implementing correctly, and that's really what I'm getting at. If the majority agrees with you, however, I have no problem with the question being migrated to CodeReview. — wootscootinboogie, Dec 17 '13 at 17:26
Unless you expect the method to return an object named St. Dev, consider a more meaningful name. — Magus, Dec 17 '13 at 17:32
I'd split `GetColumnStDev` into two parts. One part `StdDev` takes an `IEnumerable` similar to LINQ's `Average`, the other selects the correct column and calls `StdDev`. — CodesInChaos, Dec 17 '13 at 17:42
@CodesInChaos that's a good idea. I had done this same sort of thing earlier in JavaScript and I took for granted how cool lexical scoping is. — wootscootinboogie, Dec 17 '13 at 17:50

score 1 · Answer 1 · edited May 23 '17 at 12:24

Disclaimer: Been a while since I've touched .NET so this may all be nonsense :)

When you look at static methods with one or two parameters like this, I always ask my self "can this method belong to the object itself", so as an example if you have a method like:

public static String FullAddress(Address address){
 //Build full address from address properties
}

That would be glaringly obvious to me that this method should belong on the Address object directly i.e. address.FullAddress.

Also, I look at method names to see if what I am passing in is really relevant to the task. For something like average and standard deviation these methods should only really take in the values and have no interest whether they come from a DataTable, file or anything else.

With that in mind my first task would be to extract the average calculations i.e. (just showing interface as you guess implementation)

interface NormalCalculator {
  double CalculateAverage(IEnumerable<double> values)
  double CalculateStDev(IEnumerable<double> values) 
}

Back to what I initially said about seeing a method with one input, you ask if the methods can live directly on IEnumerable instead of as a separate class. Now, we can only do this as extension methods, and as your aware there is already an Average extension method.

So the question then is why not add another extension method to IEnumerable to calculate the standard deviation.

Something like: Standard Deviation in LINQ

Additionally, I don't think it would be too much abuse to add an extension method to get the get the values from a i.e.

public static class DataTableExtensions
{
    public static IEnumerable<Double> Values(this DataTable dt, DataColumn column)
    {
       return dt.AsEnumerable().Select(x => x.Field<double>(column.ColumnName));
    }
}

Then all your left with in you transformation method is:

    public static IEnumerable<NormalData>  GetNormalDataByTableColumns(DataTable dt)
    {
        dt.Columns.Select(c => {
          var values = dt.Values(c);
          return new NormalData(values.Average(), values.StDev, c.ColoumnName); //Added constructor too as it makes sense to me, but can use named args if not 
        });
    }

Finally, this is basically a mapper/adaptor/transformer, whatever you want to call it (i.e. it takes in one object and maps/adapts or transforms it to another).

I don't know if your testing much in your app but having this as a static utility method means that each time you might want to test something that uses NormalData you'd have to create a DataTable with the correct initial data setup. I would move this out into a separate mapping/transforming class i.e.

interface INormalDataMapper {
   NormalData Map(DataTable dataTable);
}

This was you can mock the Map method return with your POCO object. Though that might be too much info :).

Chris

score 1 · Accepted Answer · edited Feb 08 '17 at 14:49

You can replace your for loop by a foreach loop using a meaningful iterator columnName instead of the meaningless i.

Alternatively you can replace the loop by Select.
I'd put the column selection logic in that loop, separating it from the computations.
You don't need to pass the average to the StdDev function if you use the following identity:

(from http://en.wikipedia.org/wiki/Standard_deviation)

Your loop becomes:

foreach (string columnName in columnList)
{
    var columnData = dt.AsEnumerable().Select(x => x.Field<double>(columnName));

    //creates a NormalData object for each column in the DataTable
    NormalData normalData = new NormalData();
    normalData.Mean = columnData.Average();
    normalData.StandardDeviation = StdDev(columnData);
    normalData.AttributeName = columnName;

    //add to NormalDataList
    normalDataList.Add(normalData);
}

with a helper method:

public static double StdDev(IEnumerable<double> seq)
{
    long count = 0;
    double sum = 0;
    double sumOfSquares = 0;
    foreach(var value in seq)
    {
        sum += value;
        sumOfSquares += value * value;
        count++;
    }
    double average = sum / count;
    double averageSquare = sumOfSquares / count;
    return Math.Sqrt(averageSquare - average * average);
}

score 1 · Answer 3 · answered Dec 17 '13 at 18:19

How about something like this:

public sealed class NormalData
{
    private readonly string _attributeName;
    private uint _count;
    private double _sum;
    private double _sumOfSquares;

    private NormalData(string attributeName)
    {
        _attributeName = attributeName;
    }

    public string AttributeName 
    {
        get { return _attributeName; }
    }

    public double Mean
    {
        get { return _count == 0 ? double.NaN : _sum / _count; }
    }

    public double StandardDeviation
    {
        get
        {
            if (_count == 0) return double.NaN;

            var diff = _sumOfSquares - (_sum * _sum / _count);
            return Math.Sqrt(diff / _count);
        }
    }

    public double EstimatedStandardDeviation
    {
        get
        {
            if (_count < 2) return double.NaN;

            var diff = _sumOfSquares - (_sum * _sum / _count);
            return Math.Sqrt(diff / (_count - 1));
        }
    }

    public void Add(double value)
    {
        _count = checked(_count + 1);
        _sum += value;
        _sumOfSquares += (value * value);
    }

    public static NormalData Create(string attributeName)
    {
        return new NormalData(attributeName, 0, 0, 0);
    }
}

public static IEnumerable<NormalData>  GetNormalDataByTableColumns(DataTable dt)
{
    var normalDataList = dt.Columns.Cast<DataColumn>().Select(c => NormalData.Create(c.ColumnName)).ToList();

    foreach (DataRow row in dt.Rows)
    {
        foreach (NormalData item in normalDataList)
        {
            double value = row.Field<double>(item.AttributeName);
            item.Add(value);
        }
    }

    return normalDataList;
}

Refactoring LINQ methods

3 Answers3