Why is there a large performance impact when looping over an array with 240 or more elements?

Question

When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.

Is there special compilation optimization Rust is doing for "short" arrays?

Compiled with rustc -C opt-level=3.

use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main() {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }
    let mut sum = 0;
    let now = Instant::now();
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }
    println!("sum:{} time:{:?}", sum, now.elapsed());
}

Maybe with 240 you are overflowing a CPU cache line? If that is the case, your results would be very CPU specific. — rodrigo, Aug 12 '19 at 11:30
Reproduced [here](https://play.rust-lang.org/?version=stable&mode=release&edition=2018&gist=257a3b6f7bdf66aaa4ce3d81855b6ed5). Now I'm guessing that it has something to do with loop unrolling. — rodrigo, Aug 12 '19 at 11:51

Lukas Kalbertodt · Accepted Answer · 2019-08-16T10:05:03.450

Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.

You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:

(All code in this answer is compiled with -C opt-level=3)

pub fn foo() -> usize {
    let arr = [0; 240];
    let mut s = 0;
    for i in 0..arr.len() {
        s += arr[i];
    }
    s
}

This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:

movdqa  xmm1, xmmword ptr [rsp + 32]
movdqa  xmm0, xmmword ptr [rsp + 48]
paddq   xmm1, xmmword ptr [rsp]
paddq   xmm0, xmmword ptr [rsp + 16]
paddq   xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq   xmm0, xmmword ptr [rsp + 1840]
paddq   xmm1, xmmword ptr [rsp + 1856]
paddq   xmm0, xmmword ptr [rsp + 1872]
paddq   xmm0, xmm1
pshufd  xmm1, xmm0, 78
paddq   xmm1, xmm0

This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.

In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.

Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.

LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.

But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:

const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }

    let mut sum = 0;
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }

    sum
}

(On Godbolt Compiler Explorer)

The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.

The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!

First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):

        ; at the start of the function, `rbx` was set to 0

        movq    rax, xmm1     ; result of SIMD summing up stored in `rax`
        add     rax, 711      ; add up missing terms from loop unrolling
        mov     ecx, 500000   ; * init loop variable outer loop
.LBB0_1:
        add     rbx, rax      ; * rbx += rax
        add     rcx, -1       ; * decrement loop variable
        jne     .LBB0_1       ; * if loop variable != 0 jump to LBB0_1
        mov     rax, rbx      ; move rbx (the sum) back to rax
        ; two unimportant instructions omitted
        ret                   ; the return value is stored in `rax`

As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.

This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.

An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.

As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.

@lukas-kalbertodt thanks for the great answer! now I also understand why original code that updated `sum` directly on not a local `s` was running much slower. ```for i in 0..arr.len() { sum += arr[i]; }``` — Guy Korland, Aug 12 '19 at 14:00
@LukasKalbertodt [Something else is going on in LLVM](https://godbolt.org/z/fIbhYX) turning on AVX2 shouldn't make that big of a difference. [Repro'd in rust too](https://rust.godbolt.org/z/QwZnUQ) — Mgetz, Aug 12 '19 at 14:22
@Mgetz Interesting! But it doesn't sound too crazy to me to make that threshold dependent on the available SIMD instructions, as this ultimately determines the number of instructions in a completely unrolled loop. But unfortunately, I cannot say for sure. Would be sweet to have an LLVM dev answering this. — Lukas Kalbertodt, Aug 12 '19 at 14:30
@LukasKalbertodt I'm curious if this whole thing related to some issues the LLVM team have discussed in regards to unsigned iterators and UB cases in C/C++ that they've done talks on. But that's purely speculation. — Mgetz, Aug 12 '19 at 14:34
Great answer! I would say there are things you can "do about this" though. Which is to write the more optimized version in the first place, so you are less likely to be subject to the whims of varying compilers and settings. — jackmott, Aug 13 '19 at 12:56
Why doesn't the compiler or LLVM realize that the entire calculation can be made at compile time? I would have expected to have the loop result hardcoded. Or is the use of `Instant` preventing that? — Uncreative Name, Aug 13 '19 at 13:34
Why should the analysis to determine a calculation in an inner loop can be moved outside it depend on any number of bytes? That should be a property of the loop regardless of number of iterations. — Joseph Garvin, Aug 13 '19 at 16:12
You could also add that the rust book talks about the code unrolling and about the idiomatic recommendations: https://doc.rust-lang.org/book/ch13-04-performance.html — Muqito, Aug 13 '19 at 17:17
@JosephGarvin: I assume it's because fully unrolling happens to allow the later optimization pass to see that. Remember that optimizing compilers still care about compiling quickly, as well as making efficient asm, so they have to limit the worst-case complexity of any analysis they do so it doesn't take hours / days to compile some nasty source code with complicated loops. But yes, this is obviously a missed optimization for size >= 240. I wonder if not optimizing away loops inside of loops is intentional to avoid breaking simple benchmarks? Probably not, but maybe. — Peter Cordes, Aug 14 '19 at 02:58

mja · Answer 2 · 2019-08-15T13:58:48.553

30

In addition to Lukas' answer, if you want to use an iterator, try this:

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize {
    (0..CAPACITY).sum::<usize>() * IN_LOOPS
}

Thanks @Chris Morgan for the suggestion about range pattern.

The optimized assembly is quite good:

example::bar:
        movabs  rax, 14340000000
        ret

edited Aug 15 '19 at 13:58

answered Aug 13 '19 at 14:26

mja

947
1
15
22

3

Or better still, `(0..CAPACITY).sum::() * IN_LOOPS`, which yields the same result. – Chris Morgan Aug 13 '19 at 14:58
11

I would actually explain that the assembly is not actually doing the calculation, but LLVM has precomputed the answer in this case. – Josep Aug 13 '19 at 18:09
I’m kind of surprised that `rustc` is missing the opportunity to do this strength-reduction. In this specific context, though, this appears to be a timing loop, and you deliberately want it not to be optimized out. The whole point is to repeat the computation that number of times from scratch and divide by the number of repetitions. In C, the (unofficial) idiom for that is to declare the loop counter as `volatile`, e.g. the BogoMIPS counter in the Linux kernel. Is there a way to achieve this in Rust? There might be, but I don’t know it. Calling an external `fn` might help. – Davislor Aug 14 '19 at 16:08
@Davislor: A `volatile` loop counter also forces it to be stored in memory, not a register. Not a big deal for an outer loop, but for an inner loop it introduces store-forwarding latency. e.g. on modern x86 it makes a tight loop up to 5x slower, from a 1-cycle loop-carried dep chain to 5 or 6 cycles. If you *just* want a delay loop that's fine, though, and acceptable in a microbenchmark as long as the loop body is big enough, and it doesn't CSE across iterations. (i.e. hoist the real work out and turn it into 2 sequent loops instead of nested loops. `volatile` doesn't stop that at all.) – Peter Cordes Aug 15 '19 at 11:33
TL:DR: a `volatile` loop counter isn't a specially-recognized idiom, it's just a consequence of what C `volatile` means on a normal CPU with registers and memory. – Peter Cordes Aug 15 '19 at 11:35
@PeterCordes How to do it in C is getting a little off-topic, but in general a timing loop will be an outer loop and a delay loop will always be innermost. The reason you’d use `volatile` here is specifically to force the computation to actually be performed, not optimized away. (If you wanted to do synchronization in modern C, you would use atomics, not `volatile`. Low-level code might also use `volatile` for memory-mapped hardware.) – Davislor Aug 15 '19 at 16:06
1

@Davislor: `volatile` forces that memory to be in sync. Applying it to the loop counter only forces actual reload/store of the loop counter value. It doesn't directly affect the loop body. That's why a better way to use it is normally to assign the actual important result to `volatile int sink` or something either after the loop (if there's a loop-carried dependency) or every iteration, to let the compiler optimize the loop counter however it wants but force it to materialize *the result you want* in a register so it can store it. – Peter Cordes Aug 15 '19 at 20:52
1

@Davislor: I think Rust has inline asm syntax something like GNU C. You can use inline asm to force the compiler to materialize a value *in a register* without forcing it to store it. Using that on the result of each loop iteration can stop it from optimizing away. (But also from auto-vectorizing if you aren't careful). e.g. ["Escape" and "Clobber" equivalent in MSVC](//stackoverflow.com/q/33975479) explains 2 macros (while asking how to port them to MSVC which isn't really possible) and links to Chandler Carruth's talk where he shows their use. – Peter Cordes Aug 15 '19 at 21:00
@PeterCordes Rust also lets you call a function written in C. Is there a way to guarantee that a timing loop works, natively? Perhaps updating an atomic counter with relaxed memory ordering? – Davislor Aug 15 '19 at 21:15
@Davislor: The inline asm statement I'm talking about has an empty template so it's not ISA specific, assuming Rust supports it. Delay-loop timing is never portable, though; different platforms have different performance, and frequency-scaling... A relaxed-atomic increment will be significantly slower than `volatile` (separate load and store), but yeah that could work too with fewer iterations. Modern uarches only need a cache-lock in the core doing the increment so it shouldn't interfere with DMA or other cores. Spinning on a gettimeofday or equivalent can also work. – Peter Cordes Aug 15 '19 at 23:12
@PeterCordes To clarify, by “timing loop,” I was thinking of a loop like the one in this example: run an operation N times to get a more accurate measurement of the average time. A delay loop, I suppose, might better yield the CPU. – Davislor Aug 15 '19 at 23:56
@Davislor: oh lol, no you wouldn't use an atomic increment in a benchmark repeat loop!! Except maybe on a non-x86 system where relaxed is actually possible in asm. On x86 `lock add` is a full memory barrier, effectively strengthening relaxed to seq_cst. Like I said, use inline asm to force the compiler to materialize some value from the inner loop in a register, or assign it to a volatile object, every iteration. Or if you can get away with it, sum it in the loop and use the result. https://doc.rust-lang.org/1.8.0/book/inline-assembly.html Rust inline asm does roughly match GNU C. – Peter Cordes Aug 16 '19 at 00:01
@PeterCordes Is that really the best way to do it in Rust? I’m surprised. – Davislor Aug 16 '19 at 00:11
@Davislor: It's the best way in C, if you want to force the compiler to actually compute the value of each loop body. An empty inline asm statement that makes the compiler forget what it knows about a variable's value is exactly what you need to block optimizations without actually making the asm do unnecessary work that could be a bottleneck itself. And forcing it to actually count a loop counter a certain way is not useful; what you care about is making it do the work in the loop body. I don't know Rust very well, but it seems pretty obvious that's the best way to control the optimizer. – Peter Cordes Aug 16 '19 at 00:15

Why is there a large performance impact when looping over an array with 240 or more elements?

2 Answers2

Linked