What type of addresses can the port 7 store AGU handle on recent Intel x86?

Question

Starting with Haswell, Intel CPU micro-architectures have had a dedicated store-address unit on port 7 which can handle the address-generation uop for some store operations (the other uop, store data always goes to port 4).

Originally it was believed that this could handle any type of addresses, but this seems not to be the case. What types of addresses can this port handle?

AFAIK, it can handle any non-indexed addressing mode. Agner Fog's tables claim that some instructions like `pextrd [mem], xmm0, 2` can't use p7, but according to my testing on Haswell/Skylake the store-address uop can use p2/p3/p7 just like a regular store. But `vextracti128 [rdi], ymm0, 0 or 1` can't use p7 even with a simple addressing mode. (Re-confirmed on SKL just now) — Peter Cordes, May 28 '18 at 02:41
I thought there was a limitation on the size of the `offset` to 2048, or is that only related to the 4 -> 5 bump for L1 latency? Oddly although I feel like I have seen this discussed many times, including discussing it myself, I don't find a reference (not in Agner's guides anyways). @PeterCordes — BeeOnRope, May 28 '18 at 02:51
I wondered the same thing, so I tested with `[rsi - 4000]`, which is not in the `[+0, +2047]` range for 4c latency, and still saw counts distributed across all three AGU ports. RIP-relative was fine, too. So was a `[disp32]` absolute address `[abs buf]` (in 64-bit mode). — Peter Cordes, May 28 '18 at 02:54
Thanks @PeterCordes. I added it to a list of [x86 performance limiters](https://github.com/travisdowns/uarch-bench/wiki/Performance-limiters) I'm compiling. — BeeOnRope, May 28 '18 at 03:04
[Related Q&A](https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62/25966091#25966091). — Iwillnotexist Idonotexist, May 29 '18 at 02:47

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

4

This answer applies to Haswell and Skylake (/Kaby Lake / Coffee Lake). Future ISAs (Cannon Lake / Ice Lake) will have to be checked when they're available. The port 7 AGU was new in Haswell.

For instructions that can use port7 at all (e.g. not vextracti128), any non-indexed addressing mode can use port 7.

This includes RIP-relative, and 64-bit absolute (mov [qword abs buf], eax, even in a PIE executable loaded above 2^32, so the address really doesn't fit in 32 bits), as well as normal [reg + disp0/8/32] or absolute [disp32].

An index register always prevents use of port7, e.g. [rdi + rax], or [disp32 + rax*2]. Even [NOSPLIT disp32 + rax*1] can't use port 7 (so HSW/SKL doesn't internally convert an indexed with scale=1 and no base register into a base+disp32 addressing mode.)

I tested myself with ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7 ./testloop on a Skylake i7-6700k.

The [+0, +2047] range of displacements makes no different for stores: mov [rsi - 4000], rax can use port 7.

Non-indexed loads with small positive displacements have 1c lower latency. No special case for stores is mentioned in Intel's optimization manual. Skylake's variable-latency store-forwarding (with worse latency when the load tries to execute right away after the store) makes it hard to construct a microbenchmark that includes store latency but isn't affected by having store-address uops compete with loads for fewer ports. I haven't come up with a microbenchmark with a loop-carried dependency chain through a store-address uop but not through the store-data uop. Presumably it's possible, but maybe needs an array instead of a single location.

Some instructions can't use port7 at all:

vextracti128 [rdi], ymm0, 0 includes a store-address uop (of course), but it can only run on port 2 or port 3.

Agner Fog's instruction tables have at least one error here, though: he lists pextrb/w/d/q as only running the store-address uop on p23, but in fact it can use any of p237 on HSW/SKL.

I haven't tested this exhaustively, but one difference between HSW and SKL I found¹ was VCVTPS2PH [mem], xmm/ymm, imm8. (The instruction changed to use fewer ALU uops, so that doesn't indicate a change in p7 between HSW and SKL).

On Haswell: VCVTPS2PH is 4 uops (fused and unfused domain): p1 p4 p5 p23 (Agner Fog is right).
On Skylake: VCVTPS2PH xmm is 2 fused / 3 unfused uops: p01 p4 p237
On Skylake: VCVTPS2PH ymm is 3 fused / 3 unfused uops: p01 p4 p237

(Agner Fog lists VCVTPS2PH v as 3F/3U (one entry for both vector widths), missing the micro-fusion with the xmm version, and incorrectly lists the port breakdown as p01 p4 p23).

In general, beware that Agner's recent updates seem a little sloppy, like copy/paste or typo errors (e.g. 5 instead of 0.5 for Ryzen vbroadcastf128 y,m128 throughput).

1: HSW testing was on an old laptop that's no longer usable (I used its RAM to upgrade another machine that still gets regular use). I don't have a Broadwell to test on. Everything in this answer is definitely true on Skylake: I double checked it just now. I tested some of this a while ago on Haswell, and still had my notes from that.

edited Jun 20 '20 at 09:12

Community

1
1

answered May 28 '18 at 03:47

Peter Cordes

245,674
35
423
606

Perhaps the fast LEA unit(s) can only handle part of the computation of base+index+offset and then make use of one of the AGU units including p7 to compute the final result. – Hadi Brais May 28 '18 at 04:15
@HadiBrais - are you talking about the actual `lea` instruction? AFAIK there is no evidence this runs on any of the AGU ports on recent Intel, and such evidence should be easy to find since it would compete with agens needed by loads/stores but that's hasn't shown up. – BeeOnRope May 28 '18 at 04:27
@HadiBrais: On SnB-family, an LEA with all 3 components (base + index + displacement) is a slow LEA (3c latency), and can only run on port 1. An LEA with up to 2 components (even if that includes a scaled index) is "fast" and can run on p15. There's no indication that ALU uops ever block store-address uops from dispatching to port 7. Intel handled the case of AVX512 multiply and FP SIMD instructions borrowing resources from port 1 by shutting down the entire vector ALU (even for booleans) on port1 when any 512-bit uops are in the scheduler. – Peter Cordes May 28 '18 at 04:28
Great answer, I wasn't aware of the `vextracti128` difference. Perhaps those instructions work by simply transforming it into a store with an adjusted offset depending on the immediate value, and only p23 are capable enough to do that. I don't know about `VCVTPS2PH` though, that guy is just weird. – BeeOnRope May 28 '18 at 04:30
BTW do you have any reference for the fact that store-forwarding slows down when you have more stores in flight? We've discussed it before but wondering if there is something new. My finding was that it just depends on when the load is executed relative to the corresponding store (the store it hits). If the load executes 3 cycles later, the latency is 3 cycles (4 cycles, 4 cycles, etc). If it comes _earlier_ than that though, the latency is worse, about 4.5 cycles and not stable. – BeeOnRope May 28 '18 at 04:32
This could be explained by different mechanisms when waking a "sleeping" load that tried to executed before the store was ready, versus forwarding an already ready store to a load as soon as the load is ready. Reading of various patents seems to support this view that different mechanisms are involved. – BeeOnRope May 28 '18 at 04:35
Yes, that is true on SnB. Indexed LEAs will only be issued to p1, not p0 (I think). But that may not be the case in Haswell because the fast LEA unit was relocated to a different port (p5). So even the fast LEA on Haswell (and later) might be able to handle an base+index+offset LEA by making use of p7. So either p7 will compute base+offset first and then pass the result to the fast LEA or the fast LEA will compute base+index and pass the result to p7 to compute the final result. That would not make the fast LEA "fast", but it will improve overall performance in certain situations. – Hadi Brais May 28 '18 at 04:37
@BeeOnRope: ALU+store instructions are funky. They sometimes can't micro-fuse the store, and that seems more common when there are more total uops. So maybe reducing it from 2 to 1 ALU uop is what made the diff. But for `vextracti/f128`, the immediate has no effect on which bytes of memory are written. It's a 16-byte store. Maybe the store-address execution unit needs to write something more than *just* the address+length into the memory-order buffer for store-forwarding to work? Or the immediate is important somehow? BTW, `movhps` can use p7. – Peter Cordes May 28 '18 at 04:37
1

@HadiBrais: There's zero evidence for ALUs on one port borrowing cycles on execution units that are accessable from a different port, except for the Skylake-AVX512 case where the possibly-borrowed execution unit is shut down entirely with a coarse heuristic. If your theory was right, you could create a loop where LEAs stole cycles from AGUs, but I've haven't seen or read any evidence of that. It would make the (power-intensive) scheduler unnecessarily complicated just to avoid replicating an adder or something! https://en.wikipedia.org/wiki/Dark_silicon is not a problem. – Peter Cordes May 28 '18 at 04:42
@BeeOnRope: I don't have any more info about the variable store-forwarding latency. In-flight was probably not the best description; I was trying to describe what we already know, that store-forwarding gets slower when the load address is ready too soon. Reworded, thanks for reminding me that it's not an issue of how many are in-flight. – Peter Cordes May 28 '18 at 04:44
@HadiBrais - I'm not following you on the fast lea thing. What do you mean by "fast lea unit"? When I say "fast lea" I mean a `lea` that completes in 1 cycle. Those 2-input `lea` run on 2 ALUs and I have never seen a 3-input lea run in 1 cycle. It seems very unlikely that an unrelated EU could "help out" a single-uop instruction executing on another port: that is basically unprecedented and it's hard to see how it would work in terms of coordination (certainly not within 1 cycle). About the slow LEA it seems easy to accomplish in 3 cycles since most of the work (add + scake) happens in 1. – BeeOnRope May 28 '18 at 04:50
@BeeOnRope: BTW, the only reason slow LEA is 3 cycles instead of 2 is to keep uop latencies standardized at 1, 3, or 4 cycles (or 5c on pre-Skylake), or high/variable latency. This is to simplify the scheduler to save power in SnB-family vs. Nehalem. (The 5c latency vector-integer multiply ALU instructions on Skylake are actually 4c + 1c domain-crossing latency, to or from the FMA unit I guess. PHMINPOSUW is 4c on SKL, down from 5 for this reason. IDK how to explain SKL x87 `fmul` still being 5c; I guess it's special.) – Peter Cordes May 28 '18 at 05:01
Yes, I know - my point is that fact gives it _tons_ of time to do the extra addition. – BeeOnRope May 28 '18 at 05:02
You're right. It doesn't work :(. One has to manually break a store with base+index+offset address into an LEA with base+index (can be issued to any of the LEA units including "fast LEA") and a store with base+offset (can be issued to any of the AGU units including p7). @BeeOnRope I'm referring to the fast LEA unit at port 5 in Haswell. An LEA with base+index+offset cannot be broken to make use of p7. – Hadi Brais May 28 '18 at 05:05
What we've been saying is there is no specific "fast LEA" port in recent Intel. There are only two ports for LEA and they can both do the "fast LEA" (1 cycle) equally well. Only one of the ports can do a complex _slow_ lea, however. @HadiBrais – BeeOnRope May 28 '18 at 05:07
@BeeOnRope Yea but the one that cannot do complex LEA is called "fast LEA". [https://www.realworldtech.com/haswell-cpu/4/](https://www.realworldtech.com/haswell-cpu/4/). It's just the name that I'm using. – Hadi Brais May 28 '18 at 05:09
Got it. Anyways, I'm quite sure the "fast lea" unit doesn't get any help from `p7` to secretly do complex leas. It is a bit weird to me that one of the lea units can't do complex lea - since those can take 3 cycles it seems like the complex lea capability is almost free: the hard stuff (scaled addition in 1 cycle) is already done and you have 2 cycles just to do the last addition? It's not like you'd need special hardware to do that... Maybe it's something like the uop format is special for the complex lea and only the complex lea port understands it... – BeeOnRope May 28 '18 at 05:15
@BeeOnRope You've to look at it from low-level hardware perspective. Each of the components of the address requires wires to carry data and control signals and some control logic. If the designers ran out of space to route these wires around stuff, then the easiest approach would be to just support less types of addresses. – Hadi Brais May 28 '18 at 05:27
1

@BeeOnRope: The integer ALU on port5 doesn't run any 3-cycle latency uops, so it doesn't have/need 3 pipeline stages. The vector ALU on that port does (lane-crossing shuffles), but execution units aren't the same as ports. It's not a coincidence that all the 3c-latency integer uops go to port 1. (`imul`, `popcnt`, `shrd`, `crc32`, slow-LEA, PEXT/PDEP) – Peter Cordes May 28 '18 at 05:32
@PeterCordes - that makes sense. – BeeOnRope May 28 '18 at 05:34
1

@HadiBrais - sure, but my claim is that the additional complexity of the final addition is essentially zero: the unit is already capable of basically the full complexity of `lea`, minus the last addition. The unit could just re-use the existing adder in the next cycle to do that. The unit is already very complex, supporting a dozen or more instructions, so it is not really feasible that the designers ran out of room for some `lea`-specific wires and couldn't solve this. Peter's comment clears it up: this unit doesn't support any 3-cycle instructions period, so the whole thing can be simpler. – BeeOnRope May 28 '18 at 05:36
I've verified this on Broadwell. – Hadi Brais Oct 04 '18 at 09:49

What type of addresses can the port 7 store AGU handle on recent Intel x86?

1 Answers1

Some instructions can't use port7 at all: