AVX VMOVDQA slower than two SSE MOVDQA?

Question

While I was working on my fast ADD loop (Speed up x64 assembler ADD loop), I was testing memory access with SSE and AVX instructions. To add I have to read two inputs and produce one output. So I wrote a dummy routine that reads two x64 values into registers and write one back to memory without doing any operation. This is of course useless, I only did it for benchmarking.

I use an unrolled loop that handles 64 bytes per loop. It is comprised of 8 blocks like this:

mov rax, QWORD PTR [rdx+r11*8-64]
mov r10, QWORD PTR [r8+r11*8-64]
mov QWORD PTR [rcx+r11*8-64], rax

Then I upgraded it to SSE2. Now I use 4 blocks like this:

movdqa xmm0, XMMWORD PTR [rdx+r11*8-64]
movdqa xmm1, XMMWORD PTR [r8+r11*8-64]
movdqa XMMWORD PTR [rcx+r11*8-64], xmm0

And later on I used AVX (256 bit per register). I have 2 blocks like this:

vmovdqa ymm0, YMMWORD PTR [rdx+r11*8-64]
vmovdqa ymm1, YMMWORD PTR [r8+r11*8-64]
vmovdqa YMMWORD PTR [rcx+r11*8-64], ymm0

So far, so not-so-extremely-spectacular. What is interesting is the benchmarking result: When I run the three different approaches on 1k+1k=1k 64-bit words (i.e. two times 8 kb of input and one time 8kb of output) I get strange results. Each of the following timings is for processing two times 64 bytes input into 64 bytes of output.

The x64 register method runs at about 15 cycles/64 bytes
The SSE2 method runs at about 8.5 cycles/64 bytes
The AVX method runs at about 9 cycles/64 bytes

My question is: how come the AVX method is slower (though not a lot) than the SSE2 method? I expected it to be at least on par. Does using the YMM registers cost so much extra time? The memory was aligned (you get GPF's otherwise).

Does anyone have an explanation for this?

I seem to recall that on current architectures, AVX memory accesses are chopped up into 2 separate 128-bit accesses in some circumstances. Perhaps that's what you're running into here. The real benefits of AVX come when you start doing actual computations, as you can obviously do twice as many in parallel as with SSE. — Jason R, Dec 20 '12 at 15:51
Ah, interesting. Do you have any pointer on that? A quick search found this, but they claim the memory path to be full 256-bit: — cxxl, Dec 20 '12 at 16:03
Also be careful about mixing legacy (non VEX) SSE instructions with AVX instructions - without seeing the rest of the benchmarking code it's not clear whether this is relevant, but you probably should be aware of it anyway: http://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties — Paul R, Dec 20 '12 at 16:15
I don't much like your 8 multipliers in the SSE and AVX versions, you'd need to use 16 and 32 (but you can't do that in the addressing) and run the loop less times of course. Did your benchmark code correctly account for this? — Jester, Dec 20 '12 at 16:51
@Paul R: Thanks for the comment, but the penalties are paid when switching from one instruction set to the other. But my whole AVX loop only uses AVX and regular instructions, no SSE/FP. — cxxl, Dec 20 '12 at 21:29
No problem - I just wanted to make sure that you didn't have any legacy SSE instructions in your AVX benchmarking code. — Paul R, Dec 20 '12 at 21:35

Stephen Canon · Accepted Answer · 2016-02-06T14:28:51.627

On Sandybridge/Ivybridge, 256b AVX loads and stores are cracked into two 128b ops [as Peter Cordes notes, these aren't quite µops, but it requires two cycles for the operation to clear the port] in the load/store execution units, so there's no reason to expect the version using those instructions to be much faster.

Why is it slower? Two possibilities come to mind:

for base + index + offset addressing, the latency of a 128b load is 6 cycles, whereas the latency of a 256b load is 7 cycles (Table 2-8 in the Intel Optimization Manual). Although your benchmark should be bound by thoughput and not latency, the longer latency means that the processor takes longer to recover from any hiccups (pipeline bubbles or prediction misses or interrupt servicing or ...), which does have some impact.
in 11.6.2 of the same document, Intel suggests that the penalty for cache line and page crossing may be larger for 256b loads than it is for 128b loads. If your loads are not all 32-byte aligned, this may also explain the slowdown you are seeing when using the 256b load/store operations:

Example 11-12 shows two implementations for SAXPY with unaligned addresses. Alternative 1 uses 32 byte loads and alternative 2 uses 16 byte loads. These code samples are executed with two source buffers, src1, src2, at 4 byte offset from 32- Byte alignment, and a destination buffer, DST, that is 32-Byte aligned. Using two 16- byte memory operations in lieu of 32-byte memory access performs faster.

Note that this is not the case for Haswell, which has been released since I originally wrote this answer. — Stephen Canon, Sep 16 '13 at 14:11
It's not 2 uops, but it takes 2 cycles in the execution unit to do both halves. The AGU is only needed on the first cycle, and is free (to e.g. compute a store address) on the 2nd cycle, which is why SnB/IvB designers didn't feel the need to include a separate store-address port. Haswell has one, because it can do 256b transfers in a single cycle. Anyway, the difference between 1 uop or not is in the 4 uops / cycle throughput of the pipeline. — Peter Cordes, Jul 03 '15 at 03:58
Unaligned loads / stores can't be the problem, because the OP used `vmovdqa`, which faults on unaligned. Including that paragraph still makes the answer better, though. — Peter Cordes, Feb 06 '16 at 16:54

AVX VMOVDQA slower than two SSE MOVDQA?

1 Answers1