14

I am having an issue with the performance of a LINQ query and so I created a small simplified example to demonstrate the issue below. The code takes a random list of small integers and returns the list partitioned into several smaller lists each which totals 10 or less.

The problem is that (as I've written this) the code takes exponentially longer with N. This is only an O(N) problem. With N=2500, the code takes over 10 seconds to run on my pc.

I would appriciate greatly if someone could explain what is going on. Thanks, Mark.

int N = 250;
Random r = new Random();
var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();
var chunks = new List<List<int>>();

// work.Dump("All the work.");  // LINQPad Print
var workEnumerable = work.AsEnumerable();

Stopwatch sw = Stopwatch.StartNew();
while(workEnumerable.Any())  // or .FirstorDefault() != null
{
    int soFar = 0;
    var chunk = workEnumerable.TakeWhile( x => 
                          {
                              soFar += x;               
                              return  (soFar <= 10);
                          }).ToList();
    chunks.Add(chunk);          // Commented out makes no difference.
    workEnumerable = workEnumerable.Skip(chunk.Count); // <== SUSPECT
}
sw.Stop();

// chunks.Dump("Work Chunks.");   // LINQPad Print
sw.Elapsed.Dump("Time elapsed.");
Mark
  • 500
  • 1
  • 5
  • 18
  • 10
    What `Skip` does is create a new `IEnumerable` that loops over the source, and only begins yielding results after the first N elements. You chain who knows how many of these after each other. Everytime you call `.Any()`, you need to loop over *all* the previously skipped elements again. – millimoose Nov 20 '12 at 23:05
  • @millimoose, you should post that as an answer – Thomas Levesque Nov 20 '12 at 23:06
  • `.Any()` doesn't seem to be an issue, it can be replaced by `.FirstOrDefault()` which should be O(1) no? – Mark Nov 20 '12 at 23:07
  • Also looks like a ***prime candidate*** for TPL (Parallel.ForEach). All you do is chunk a work load... – sehe Nov 20 '12 at 23:08
  • 2
    @Mark It's O(1) **after** you know which elements to check. It's getting these elements (from a chain of `Skip()`s) that's the slow part. – millimoose Nov 20 '12 at 23:09
  • 2
    @Mark: But the `Any()` is working on the `Skip()` -- so every time you call `Any()` it has to start from the beginning and skip all the elements before testing if there's any. – Cameron Nov 20 '12 at 23:09
  • @sehe: essentially yes i am chunking a work load, however it needs to be executed in sequence, 1 chunk at a time. – Mark Nov 20 '12 at 23:14
  • @Mark, You'll be happy to see my answer, which implements chunking by 10 elements, but a lot simpler – sehe Nov 20 '12 at 23:14

3 Answers3

10

What .Skip() does is create a new IEnumerable that loops over the source, and only begins yielding results after the first N elements. You chain who knows how many of these after each other. Everytime you call .Any(), you need to loop over all the previously skipped elements again.

Generally speaking, it's a bad idea to set up very complicated operator chains in LINQ and enumerating them repeatedly. Also, since LINQ is a querying API, methods like Skip() are a bad choice when what you're trying to achieve amounts to modifying a data structure.

millimoose
  • 36,982
  • 8
  • 75
  • 128
5

You effectively keep chaining Skip() onto the same enumerable. In a list of 250, the last chunk will be created from a lazy enumerable with ~25 'Skip' enumerator classes on the front.

You would find things become a lot faster, already if you did

workEnumerable = workEnumerable.Skip(chunk.Count).ToList();

However, I think the whole approach could be altered.

How about using standard LINQ to achieve the same:

See it live on http://ideone.com/JIzpml

using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    private readonly static Random r = new Random();

    public static void Main(string[] args)
    {
        int N = 250;
        var work = Enumerable.Range(1,N).Select(x => r.Next(0, 6)).ToList();

        var chunks = work.Select((o,i) => new { Index=i, Obj=o })
            .GroupBy(e => e.Index / 10)
            .Select(group => group.Select(e => e.Obj).ToList())
            .ToList();

        foreach(var chunk in chunks)
            Console.WriteLine("Chunk: {0}", string.Join(", ", chunk.Select(i => i.ToString()).ToArray()));
    }
}
Community
  • 1
  • 1
sehe
  • 328,274
  • 43
  • 416
  • 565
  • I should of elaborated, the output is not correct, ie it doesn't do what the original code does. – Mark Nov 20 '12 at 23:22
  • @Mark Arrrrg. I switched `(i,o)`, should have been `(o,i)` in the Select extension. Sorry. I should have tested it :). Now it works, see it live on **[http://ideone.com/JIzpml](http://ideone.com/JIzpml)** – sehe Nov 20 '12 at 23:27
  • Nice live link! I do like your solution however the problem is slightly more complex in that each `int` represents complexity, and each chunk needs to have a complexity of less than 10, or have only 1 entry (with a complexity higher than 10). I think millimouse is right, this is just a poor choice for Linq, I was trying to be fancy. – Mark Nov 20 '12 at 23:37
  • @Mark May I suggest a priority queue (see [here](http://stackoverflow.com/q/102398/85371)), and just bite of as much as you can chew – sehe Nov 21 '12 at 00:34
2

The Skip() method and others like it basically create a placeholder object, implementing IEnumerable, that references its parent enumerable and contains the logic to perform the skipping. Skips in loops, therefore, are non-performant, because instead of throwing away elements of the enumerable, like you think they are, they add a new layer of logic that's lazily executed when you actually need the first element after all the ones you've skipped.

You can get around this by calling ToList() or ToArray(). This forces "eager" evaluation of the Skip() method, and really does get rid of the elements you're skipping from the new collection you will be enumerating. That comes at an increased memory cost, and requires all of the elements to be known (so if you're running this on an IEnumerable that represents an infinite series, good luck).

The second option is to not use Linq, and instead use the IEnumerable implementation itself, to get and control an IEnumerator. Then instead of Skip(), simply call MoveNext() the necessary number of times.

KeithS
  • 65,745
  • 16
  • 102
  • 161