This answer applies to Haswell and Skylake (/Kaby Lake / Coffee Lake). Future ISAs (Cannon Lake / Ice Lake) will have to be checked when they're available. The port 7 AGU was new in Haswell.
For instructions that can use port7 at all (e.g. not vextracti128
), any non-indexed addressing mode can use port 7.
This includes RIP-relative, and 64-bit absolute (mov [qword abs buf], eax
, even in a PIE executable loaded above 2^32, so the address really doesn't fit in 32 bits), as well as normal [reg + disp0/8/32]
or absolute [disp32]
.
An index register always prevents use of port7, e.g. [rdi + rax]
, or [disp32 + rax*2]
. Even [NOSPLIT disp32 + rax*1]
can't use port 7 (so HSW/SKL doesn't internally convert an indexed with scale=1 and no base register into a base+disp32 addressing mode.)
I tested myself with ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_dispatched_port.port_2,uops_dispatched_port.port_3,uops_dispatched_port.port_7 ./testloop
on a Skylake i7-6700k.
The [+0, +2047]
range of displacements makes no different for stores: mov [rsi - 4000], rax
can use port 7.
Non-indexed loads with small positive displacements have 1c lower latency. No special case for stores is mentioned in Intel's optimization manual. Skylake's variable-latency store-forwarding (with worse latency when the load tries to execute right away after the store) makes it hard to construct a microbenchmark that includes store latency but isn't affected by having store-address uops compete with loads for fewer ports. I haven't come up with a microbenchmark with a loop-carried dependency chain through a store-address uop but not through the store-data uop. Presumably it's possible, but maybe needs an array instead of a single location.
Some instructions can't use port7 at all:
vextracti128 [rdi], ymm0, 0
includes a store-address uop (of course), but it can only run on port 2 or port 3.
Agner Fog's instruction tables have at least one error here, though: he lists pextrb/w/d/q
as only running the store-address uop on p23, but in fact it can use any of p237 on HSW/SKL.
I haven't tested this exhaustively, but one difference between HSW and SKL I found1 was VCVTPS2PH [mem], xmm/ymm, imm8
. (The instruction changed to use fewer ALU uops, so that doesn't indicate a change in p7 between HSW and SKL).
On Haswell: VCVTPS2PH
is 4 uops (fused and unfused domain): p1 p4 p5 p23 (Agner Fog is right).
On Skylake: VCVTPS2PH xmm
is 2 fused / 3 unfused uops: p01 p4 p237
On Skylake: VCVTPS2PH ymm
is 3 fused / 3 unfused uops: p01 p4 p237
(Agner Fog lists VCVTPS2PH v
as 3F/3U (one entry for both vector widths), missing the micro-fusion with the xmm version, and incorrectly lists the port breakdown as p01 p4 p23).
In general, beware that Agner's recent updates seem a little sloppy, like copy/paste or typo errors (e.g. 5 instead of 0.5 for Ryzen vbroadcastf128 y,m128
throughput).
1: HSW testing was on an old laptop that's no longer usable (I used its RAM to upgrade another machine that still gets regular use). I don't have a Broadwell to test on. Everything in this answer is definitely true on Skylake: I double checked it just now. I tested some of this a while ago on Haswell, and still had my notes from that.