97

Yesterday I found an article by Christoph Nahr titled ".NET Struct Performance" which benchmarked several languages (C++, C#, Java, JavaScript) for a method which adds two point structs (double tuples).

As it turned out, C++ version takes about 1000ms to execute (1e9 iterations), while C# cannot get under ~3000ms on the same machine (and performs even worse in x64).

To test it myself, I took the C# code (and simplified slightly to call only the method where parameters are passed by value), and ran it on a i7-3610QM machine (3.1Ghz boost for single core), 8GB RAM, Win8.1, using .NET 4.5.2, RELEASE build 32-bit (x86 WoW64 since my OS is 64-bit). This is the simplified version:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);
    }
}

With Point defined as simply:

public struct Point 
{
    private readonly double _x, _y;

    public Point(double x, double y) { _x = x; _y = y; }

    public double X { get { return _x; } }

    public double Y { get { return _y; } }
}

Running it produces the results similar to those in the article:

Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms

First strange observation

Since the method should be inlined, I wondered how the code would perform if I removed structs altogether and simply inlined the whole thing together:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    public static void Main()
    {
        // not using structs at all here
        double ax = 1, ay = 1, bx = 1, by = 1;

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
        {
            ax = ax + by;
            ay = ay + bx;
        }
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            ax, ay, sw.ElapsedMilliseconds);
    }
}

And got practically the same result (actually 1% slower after several retries), meaning that JIT-ter seems to be doing a good job optimizing all function calls:

Result: x=1000000001 y=1000000001, Time elapsed: 3200 ms

It also means that the benchmark doesn't seem to measure any struct performance and actually only seem to measure basic double arithmetic (after everything else gets optimized away).

The weird stuff

Now comes the weird part. If I merely add another stopwatch outside the loop (yes, I narrowed it down to this crazy step after several retries), the code runs three times faster:

public static void Main()
{
    var outerSw = Stopwatch.StartNew();     // <-- added

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    outerSw.Stop();                         // <-- added
}

Result: x=1000000001 y=1000000001, Time elapsed: 961 ms

That's ridiculous! And it's not like Stopwatch is giving me wrong results because I can clearly see that it ends after a single second.

Can anyone tell me what might be happening here?

(Update)

Here are two methods in the same program, which shows that the reason is not JITting:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Test1();
        Test2();

        Console.WriteLine();

        Test1();
        Test2();
    }

    private static void Test1()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    private static void Test2()
    {
        var swOuter = Stopwatch.StartNew();

        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);

        swOuter.Stop();
    }
}

Output:

Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms

Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms

Here is a pastebin. You need to run it as a 32-bit release on .NET 4.x (there are a couple of checks in the code to ensure this).

(Update 4)

Following @usr's comments on @Hans' answer, I checked the optimized disassembly for both methods, and they are rather different:

Test1 on the left, Test2 on the right

This seems to show that the difference might be due to compiler acting funny in the first case, rather than double field alignment?

Also, if I add two variables (total offset of 8 bytes), I still get the same speed boost - and it no longer seems it's related to field alignment mention by Hans Passant:

// this is still fast?
private static void Test3()
{
    var magical_speed_booster_1 = "whatever";
    var magical_speed_booster_2 = "whatever";

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    GC.KeepAlive(magical_speed_booster_1);
    GC.KeepAlive(magical_speed_booster_2);
}
Groo
  • 45,930
  • 15
  • 109
  • 179
  • Can you run the 2 different test in the same program? Like `void Main(){ Test1(); Test2(); Test1(); Test2();}` The reason for this: I haven't seen your results in a single process space yet. – Stefan Aug 20 '15 at 09:34
  • 1
    Beside the JIT thing it also depends on the optimizations of the compiler, the newest Ryujit does more optimizations and even introduced limited SIMD instructions support. – Felix K. Aug 20 '15 at 09:34
  • @Stefan: good point, I'll add it. – Groo Aug 20 '15 at 09:35
  • @FelixK: I think these are mostly x64 optimizations, and you explicitly need to use the `Numerics` namespace to get SIMD stuff. – Groo Aug 20 '15 at 09:36
  • 3
    Jon Skeet found a performance problem with readonly fields in structs: [Micro-optimization: the surprising inefficiency of readonly fields](http://codeblog.jonskeet.uk/2014/07/16/micro-optimization-the-surprising-inefficiency-of-readonly-fields/). Try making the private fields non-readonly. – dbc Aug 20 '15 at 09:38
  • 2
    @dbc: I did a test with only local `double` variables, no `struct`s, so I've ruled out struct layout/method call inefficiencies. – Groo Aug 20 '15 at 09:42
  • I dont see a difference between the 2 on Win7 i7 3770. Also, it takes ~15 seconds. I think this might be CPU related. Edit: Scrap that: Was running a debug build. Same (and faster times) in release mode. – leppie Aug 20 '15 at 10:09
  • I couldn't reproduce this. On my i7something the outer SW makes no difference at all. – Henk Holterman Aug 20 '15 at 10:09
  • How did you run this, a Release build outside Studio? Done any overclocking lately? – Henk Holterman Aug 20 '15 at 10:10
  • @HenkHolterman: Optimized build, run as Start w/o debugging. Got 2700ms and 800ms respectively. – leppie Aug 20 '15 at 10:11
  • 3
    Seems to only happen on 32-bit, with RyuJIT, I get 1600ms both times. – leppie Aug 20 '15 at 10:14
  • Running this in JetBrains profiler using line by line tracing also makes the difference go away, while it is there when running w/o the profiler attached – Roger Johansson Aug 20 '15 at 10:17
  • @Groo - make sure you post the _exact_ code, and I think the question currently suffers from too much variations and side branches. Try to shorten it. – Henk Holterman Aug 20 '15 at 10:18
  • Switching to CLR 2.0, I get 800ms both times. So there seems to be an issue with CLR 4.0 on 32-bit. – leppie Aug 20 '15 at 10:19
  • I'm getting slower result with the stopwatch - 2100ms with the stopwatch, 1200 without it. – Kobi Aug 20 '15 at 10:25
  • @Groo - please do specify the .NET version. I can' see a speed-up for CLR 2 but on Fx4.5.2 it's a lot slower. – Henk Holterman Aug 20 '15 at 10:26
  • if you `Stopwatch.StartNew();` before `Test1()` in `Main`, everything is constant too. Seems something is wrong with `Stopwatch` in this scenario. – leppie Aug 20 '15 at 10:26
  • Even `new StopWatch()` does the trick. – leppie Aug 20 '15 at 10:31
  • I'm getting opposite results. `Test2` takes x3 time on x32. – Yuval Itzchakov Aug 20 '15 at 10:33
  • Experimentation points to the static constructor of `Stopwatch`, did this behavior not change on CLR 4? – leppie Aug 20 '15 at 10:37
  • Another funny: If you add `Console.WriteLine(Stopwatch.Frequency);` after `var sw = Stopwatch.StartNew();`, both run the 'slow' time. – leppie Aug 20 '15 at 10:39
  • @HenkHolterman: I believe the last program with two test functions (`Test1`/`Test2`) should be the best to use because it shows the difference. Yes, this was a Release build, .NET 4.5.2, I ran it with Ctrl+F5 (and I just tried simply running the exe in`bin\x86\Release\`, same results). – Groo Aug 20 '15 at 10:42
  • @leppie See my answer below. This is not an issue with `Stopwatch`, it seems to be a deeper issue. Putting a simple conditional `if` statement in the `Console` output makes the whole thing faster too. Wierd. – InBetween Aug 20 '15 at 10:49
  • @InBetween: Something is funny, see my answer :) I suspect the `Stopwatch` static initializer is run more than once. – leppie Aug 20 '15 at 10:51
  • @HenkHolterman: and [here](http://pastebin.com/9RykzS6g) is the pastebin, if you get some time to check it. – Groo Aug 20 '15 at 11:02
  • When changing double to float in the struct, you get the same difference (it is quite a bit slower though). – leppie Aug 20 '15 at 11:03
  • @leppie: and with `int`s it's the other way around, the "boosted" version becomes slower. – Groo Aug 20 '15 at 11:07
  • 2
    I have looked at the disassembly of both methods. There is nothing interesting to see. Test1 generates inefficient code without apparent reason. JIT bug or by design. In Test1 the JIT loads and stores the doubles for each iteration to the stack. This could be to ensure exact precision because the x86 float unit uses 80 bit internal precision. I found that any non-inlined function call at the top of the function makes it go fast again. – usr Aug 20 '15 at 11:13
  • 1
    RyuJIT seems to always generate the slow kind of code. It loads two doubles from the stack for each iteration. These seem to be the constants `1.0`. Disappointing. I'm happy that it blasted the struct to registers, though. – usr Aug 20 '15 at 12:35

4 Answers4

75

There is a very simple way to always get the "fast" version of your program. Project > Properties > Build tab, untick the "Prefer 32-bit" option, ensure that the Platform target selection is AnyCPU.

You really don't prefer 32-bit, unfortunately is always turned on by default for C# projects. Historically, the Visual Studio toolset worked much better with 32-bit processes, an old problem that Microsoft has been chipping away at. Time to get that option removed, VS2015 in particular addressed the last few real road-blocks to 64-bit code with a brand-new x64 jitter and universal support for Edit+Continue.

Enough chatter, what you discovered is the importance of alignment for variables. The processor cares about it a great deal. If a variable is mis-aligned in memory then the processor has to do extra work to shuffle the bytes to get them in the right order. There are two distinct misalignment problems, one is where the bytes are still inside a single L1 cache line, that costs an extra cycle to shift them into the right position. And the extra bad one, the one you found, where part of the bytes are in one cache line and part in another. That requires two separate memory accesses and gluing them together. Three times as slow.

The double and long types are the trouble-makers in a 32-bit process. They are 64-bits in size. And can get thus get misaligned by 4, the CLR can only guarantee a 32-bit alignment. Not a problem in a 64-bit process, all variables are guaranteed to be aligned to 8. Also the underlying reason why the C# language cannot promise them to be atomic. And why arrays of double are allocated in the Large Object Heap when they have more than a 1000 elements. The LOH provides an alignment guarantee of 8. And explains why adding a local variable solved the problem, an object reference is 4 bytes so it moved the double variable by 4, now getting it aligned. By accident.

A 32-bit C or C++ compiler does extra work to ensure that double cannot be misaligned. Not exactly a simple problem to solve, the stack can be misaligned when a function is entered, given that the only guarantee is that it is aligned to 4. The prologue of such a function need to do extra work to get it aligned to 8. The same trick doesn't work in a managed program, the garbage collector cares a great deal about where exactly a local variable is located in memory. Necessary so it can discover that an object in the GC heap is still referenced. It cannot deal properly with such a variable getting moved by 4 because the stack was misaligned when the method was entered.

This is also the underlying problem with .NET jitters not easily supporting SIMD instructions. They have much stronger alignment requirements, the kind that the processor cannot solve by itself either. SSE2 requires an alignment of 16, AVX requires an alignment of 32. Can't get that in managed code.

Last but not least, also note that this makes the perf of a C# program that runs in 32-bit mode very unpredictable. When you access a double or long that's stored as a field in an object then perf can drastically change when the garbage collector compacts the heap. Which moves objects in memory, such a field can now suddenly get mis/aligned. Very random of course, can be quite a head-scratcher :)

Well, no simple fixes but one, 64-bit code is the future. Remove the jitter forcing as long as Microsoft won't change the project template. Maybe next version when they feel more confident about Ryujit.

Hans Passant
  • 873,011
  • 131
  • 1,552
  • 2,371
  • Thanks for the suggestion. Unfortunatelly, while the x64 compiler in VS2012 (pre-ryuJIT) does give consistent resutls, it actually makes both path run twice as slow: 6 seconds in both cases! – Groo Aug 20 '15 at 11:48
  • Is there an easy way to see this misalignment? I tried getting a pointer in `unsafe` and it didn't look any different (but I don't know what I am doing). Actually, I'm not getting the same effect as everyone else... – Kobi Aug 20 '15 at 11:58
  • @Kobi: add `GC.KeepAlive(your_variable);` at the end of the method to ensure compiler doesn't remove it. – Groo Aug 20 '15 at 12:00
  • 1
    Not sure how alignment plays into this when the double variables could be (and are in Test2) enregistered. Test1 uses the stack, Test2 does not. – usr Aug 20 '15 at 12:22
  • @usr: how did you find the `Test2` variables to be placed in registers? I checked the disassembly but didn't find any differences between these two methods. – Groo Aug 20 '15 at 13:30
  • @Groo if you didn't find any differences how could performance be different? Maybe you looked at unoptimized code or at the wrong version of the code. The OP has posted many. – usr Aug 20 '15 at 13:31
  • @usr: I am the OP. :-) Well, the performance could be different due to misalignment, as Hans suggested. I haven't spotted any differences between these methods, but if I add a single 32-bit variable on top of the 2nd method, it gets the speedup (see the "Solution" part at the end of my question). If you try to compare these two methods, I am pretty sure you will find that the second one only has an additional variable placed on the stack. – Groo Aug 20 '15 at 13:35
  • @Groo OK. Your "Update 2" shows the difference unmodified. .NET 4.5.2, x86, Release, no debugger attached. Open Debugger=>Modules to make sure that the EXE is being optimized. Very clear code difference in the loop. You can find the loop by finding the backedge jump. That's the end of the loop. You can use Debugger.Break(); to late attach the debugger or you need to change VS settings. – usr Aug 20 '15 at 13:37
  • @Groo they fit onto the x87 floating point stack. You should see these instructions: https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions That's not a memory stack though. It's like registers. – usr Aug 20 '15 at 13:52
  • @Groo http://pastebin.com/0X3RWESJ that's Test1. Can you reproduce that? Look at the loop. Just terrible code with loads and strange moves that I do not understand. – usr Aug 20 '15 at 13:57
  • You're right: [**disassembly is different for these two methods**](http://i.imgur.com/HJAdaMY.png). @Hans, does this mean it's JITter's fault after all? Also, if I add **two** or **three** 32-bit reference variables on top of `Test2`, it's still faster in all benchmarks (and the disassembly is indeed simpler as usr wrote), so it doesn't seem to be related to alignment after all? – Groo Aug 20 '15 at 14:13
  • 2
    This question is changing too fast for me to keep track of. You have to watch out for the test itself affecting the outcome of the test. You need to put [MethodImpl(MethodImplOptions.NoInlining)] on the test methods to compare apples to oranges. You'll now see that the optimizer can keep the variables on the FPU stack in both cases. – Hans Passant Aug 20 '15 at 14:33
  • 1
    @Hans: I did, `AddByVal` is marked with `AggressiveInlining`, while `Test1` and `Test2` are `NoInlining`. But nevertheless, `Test2` only uses `fld1`/`faddp` twice inside the loop and keeps both values in the floating point stack, while `Test1` constantly loads and stores these same values between these calls. – Groo Aug 20 '15 at 14:38
  • 1
    Just swap the Test1() and Test2() calls in the Main() method to get happy numbers again. – Hans Passant Aug 20 '15 at 15:33
  • 4
    Omg, it's true. Why does the method alignment have any impact on the instructions generated?! There should not be any difference for the loop body. All should be in registers. The alignment prologue should be irrelevant. Still seems like a JIT bug. – usr Aug 20 '15 at 15:46
  • @Hans: thanks, that's correct. But I still don't see the connection with alignment: whether the variables are aligned or not, shouldn't JIT reorder them into the fp stack registers and then work on them inside the loop in both cases? – Groo Aug 20 '15 at 21:20
  • 3
    I have to significantly revise the answer, bummer. I'll get to it by tomorrow. – Hans Passant Aug 20 '15 at 22:37
  • x87 is very slow. SSE is much faster – phuclv Aug 21 '15 at 06:40
  • 2
    @HansPassant are you going to dig through the JIT sources? That would be fun. At this point all I know is it's a random JIT bug. – usr Aug 21 '15 at 10:16
  • @HansPassant: Why is this not an issue with CLR 2.0 JIT? Seems like regression to me. – leppie Aug 21 '15 at 18:55
  • I wish I could upvote this answer a dozen more times – Slugart Jan 11 '17 at 22:24
10

Update 4 explains the problem: in the first case, JIT keeps the calculated values (a, b) on the stack; in the second case, JIT keeps it in the registers.

In fact, Test1 works slowly because of the Stopwatch. I wrote the following minimal benchmark based on BenchmarkDotNet:

[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
    private const int IterationCount = 100001;

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithoutStopwatch()
    {
        double a = 1, b = 1;
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}", a);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithStopwatch()
    {
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // fadd        qword ptr [ebp-14h]
            // fstp        qword ptr [ebp-14h]
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithTwoStopwatches()
    {
        var outerSw = new Stopwatch();
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }
}

The results on my computer:

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
Type=Jit_RegistersVsStack  Mode=Throughput  Platform=X86  Jit=HostJit  .NET=HostFramework

             Method |   AvrTime |    StdDev |       op/s |
------------------- |---------- |---------- |----------- |
   WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
      WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
 WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |

As we can see:

  • WithoutStopwatch works quickly (because a = a + b uses the registers)
  • WithStopwatch works slowly (because a = a + b uses the stack)
  • WithTwoStopwatches works quickly again (because a = a + b uses the registers)

Behavior of JIT-x86 depends on big amount of different conditions. For some reason, the first stopwatch forces JIT-x86 to use the stack, and the second stopwatch allows it to use the registers again.

AndreyAkinshin
  • 17,047
  • 25
  • 91
  • 148
  • 1
    This doesn't really explain the cause. If you check my tests, it would appear that the test which has an additional `Stopwatch` actually runs *faster*. But if you swap the order in which they are invoked in the `Main` method, then the other method gets optimized. – Groo Aug 24 '15 at 07:34
5

Narrowed it down some what (only seems to affect 32-bit CLR 4.0 runtime).

Notice the placement of the var f = Stopwatch.Frequency; makes all the difference.

Slow (2700ms):

static void Test1()
{
  Point a = new Point(1, 1), b = new Point(1, 1);
  var f = Stopwatch.Frequency;

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}

Fast (800ms):

static void Test1()
{
  var f = Stopwatch.Frequency;
  Point a = new Point(1, 1), b = new Point(1, 1);

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
leppie
  • 109,129
  • 16
  • 185
  • 292
  • Modifying the code without touching `Stopwatch` also drastically changes the speed. Changing the signature of the method to `Test1(bool warmup)` and adding a conditional in the `Console` output: `if (!warmup) { Console.WriteLine(...); }` also has the same effect (stumbled upon this while building my tests to repro the issue). – InBetween Aug 20 '15 at 10:57
  • @InBetween: I saw, something is fishy. Also only happens on structs. – leppie Aug 20 '15 at 11:01
4

There seems to be some bug in the Jitter because the behavior is even wierder. Consider the following code:

public static void Main()
{
    Test1(true);
    Test1(false);
    Console.ReadLine();
}

public static void Test1(bool warmup)
{
    Point a = new Point(1, 1), b = new Point(1, 1);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < ITERATIONS; i++)
        a = AddByVal(a, b);
    sw.Stop();

    if (!warmup)
    {
        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }
}

This will run in 900 ms, same as the outer stopwatch case. However, if we remove the if (!warmup) condition, it will run in 3000 ms. What's even stranger, is that the following code will also run in 900 ms:

public static void Test1()
{
    Point a = new Point(1, 1), b = new Point(1, 1);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < ITERATIONS; i++)
        a = AddByVal(a, b);
    sw.Stop();

    Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
        0, 0, sw.ElapsedMilliseconds);
}

Note I've removed a.X and a.Y references from the Console output.

I have no idea whats going on, but this smells pretty buggy to me and its not related to having an outer Stopwatch or not, the issue seems a bit more generalized.

InBetween
  • 30,991
  • 3
  • 46
  • 80
  • When you remove calls to `a.X` and `a.Y`, compiler is probably free to optimize away pretty much everything inside the loop, because the results of the operation are unused. – Groo Aug 20 '15 at 10:30
  • @Groo: yes, that seems reasonable but not when you take into account the other strange behavior we are seeing. Removing `a.X` and `a.Y` isn't making it go any faster than when you include the `if (!warmup)` condition or the OP's `outerSw`, which implies its not optimizing anything away, its just eliminating whatever bug is making the code run at a suboptimal speed (`3000` ms instead of `900` ms). – InBetween Aug 20 '15 at 10:35
  • 2
    Oh, ok, I thought the speed improvement happened when `warmup` was true, but in that case the line is not even printed, so the case where it *does* get printed actually references `a`. I nevertheless like to make sure I am always referencing calculation results somewhere near the end of the method, whenever I am benchmarking stuff. – Groo Aug 20 '15 at 10:46