3

In order to prepare an optimization in an existing software framework, I performed a standalone performance test, so I could assess potential gains before spending a large amount of time on it.

The Situation

There are N different types of components, some of which implement an IUpdatable interface - those are the interesting ones. They are grouped in M objects, each maintaining a list of components. Updating them works like this:

foreach (GroupObject obj in objects)
{
    foreach (Component comp in obj.Components)
    {
        IUpdatable updatable = comp as IUpdatable;
        if (updatable != null)
            updatable.Update();
    }
}

The Optimization

My goal was to optimize these updates for large amounts of grouping objects and components. First, make sure to update all components of one kind in a row, by caching them in one array per kind. Essentially, this:

foreach (IUpdatable[] compOfType in typeSortedComponents)
{
    foreach (IUpdatable updatable in compOfType)
    {
        updatable.Update();
    }
}

The thought behind it was that the JIT or the CPU might have an easier time operating on the same object type over and over again than in a shuffled version.

In the next step, I wanted to further improve the situation by making sure that all data for one Component type is aligned in memory - by storing it in a struct array, something like this:

foreach (ComponentDataStruct[] compDataOfType in typeSortedComponentData)
{
    for (int i = 0; i < compDataOfType.Length; i++)
    {
        compDataOfType[i].Update();
    }
}

The Problem

In my standalone performance tests, there is no significant performance gain from either of these changes. I'm not sure why. No significant performance gains means, with 10000 components, each batch running 100 update cycles, all main tests take around 85 milliseconds +/- 2 milliseconds.

(The only difference arises from introducing the as cast and if check, but that's not really what I was testing for.)

  • All tests were performed in Release mode, without attached debugger.
  • External disturbances were reduced by using this code:

        currentProc.ProcessorAffinity = new IntPtr(2);
        currentProc.PriorityClass = ProcessPriorityClass.High;
        currentThread.Priority = ThreadPriority.Highest;
    
  • Each test actually did some primitive math work, so it's not just measuring empty method calls which could potentially be optimized away.

  • Garbage Collection was performed explicitly before each test, to rule out that interference as well.
  • The full source code (VS Solution, Build & Run) is available here

I would have expected a significant change due to memory alignment and repetition in update patterns. So, my core question really is: Why wasn't I able to measure a significant improvement? Am I overlooking something important? Did I miss something in my tests?

Adam
  • 285
  • 1
  • 12
  • 1
    @Jamel The post you linked to is for Swift, not for C#. How is that related? – Ed Chapel Jul 05 '15 at 15:37
  • _The thought behind it was that ..._ - that thought was wrong. Don't 'guess' where the problem is, measure. – Henk Holterman Jul 05 '15 at 15:58
  • http://stackoverflow.com/questions/7484735/c-struct-vs-class-faster – Jamil Jul 05 '15 at 16:02
  • @HenkHolterman I'm asking for help in interpreting my performance measurement results - not quite sure what to make of your statement? – Adam Jul 05 '15 at 16:11
  • What does Update do? Right now you manage to perform 10m calls per second which is a lot slower than I would expect for those loops. – usr Jul 05 '15 at 16:20
  • 1
    You made an assumption that still looks really weird to me, changed (optimized) something based on that assumption and found no change. What do you think is wrong, your assumption or your measurements? – Henk Holterman Jul 05 '15 at 17:16
  • @HenkHolterman Both my assumption and my measurements could be wrong. I made an assumption, I tried to verify it, I failed. Now I want to find out, why exactly that is. That's why I asked. To me, your comment of "don't guess, measure" doesn't seem to make sense when issued to someone who just performed a measurement and is genuinely trying to discuss it. "Your assumption was wrong" is a valid answer and certainly among the ones I was looking for - I was just thrown off a little by the comment after that. – Adam Jul 05 '15 at 17:33
  • Measuring always means using instrumentation on the _actual_ software and find the critical sections. – Henk Holterman Jul 05 '15 at 17:38

1 Answers1

6

The main reason you might traditionally prefer the latter implementation is because of Locality of Reference. If the contents of the array fit into CPU cache, then your code runs a lot faster. Conversely, if you have a lot of cache misses, then your code runs much more slowly.

Your mistake, I suspect, is that the objects in your first test probably already have good locality of reference. If you allocate a lot of small objects all at once, those objects are likely to be contiguous in memory even though they're on the heap. (I'm looking for a better source for that, but I've experienced the same thing anecdotally in my own work) Even if they aren't already contiguous, the GC might be moving them around such that they are. Since modern CPUs have large caches it may be the case that the entire data structure fits in L2 cache, since there isn't much else around to compete with it. Even if the cache isn't large, modern CPUs have gotten very good at predicting usage patterns and prefetching.

It may also be the case that your code has to box/unbox your structs. This seems unlikely, however, if the performance is really so similar.

The big thing with low-level stuff like this in C# is that you really need to either a) trust the framework to do its job, or b) profile under realistic conditions after you've identified a low-level performance issue. I appreciate that this may be a toy project, or you may just be playing around with memory optimisation for giggles, but a priori optimisation as you've done in your OP is quite unlikely to yield appreciable performance improvements at the project-scale.

I haven't yet gone through your code in detail, but I suspect your problem here is unrealistic conditions. With more memory pressure, and especially more dynamic allocation of components, you might see the performance differential you expect. Then again, you might not, which is why it's so important to profile.

It's worth noting that if you know for certain in advance that strict manual optimisation of memory locality is critical to the proper functionality of your application, you may need to consider whether a managed language is the correct tool for the job.

Edit: Yeah, the problem is almost certainly here:-

public static void PrepareTest()
{
  data = new Base[Program.ObjCount]; // 10000
  for (int i = 0; i < data.Length; i++)
    data[i] = new Data(); // Data consists of four floats
}

Those 10,000 instances of Data are probably contiguous in memory. Furthermore, they probably all fit in your cache anyway, so I doubt you'd see any performance impact from cache misses in this test.

Community
  • 1
  • 1
Iain Galloway
  • 16,882
  • 4
  • 50
  • 73