Intel's intrinsics guide can be useful for finding vector instructions. It lists the asm mnemonic as well as the intrinsic (and you can search by mnemonic instead of intrinsic, since the search matches on the whole text of the entry).
Intel's PDF reference manual also has an index. The insn set ref manual is volume 2. See links to Intel's manuals in the x86 tag wiki.
SSE4.1 PINSRB could do exactly what you asked, but that will bottleneck on one shuffle per clock on Haswell and later, not achieving 2 load per clock throughput. (2 uops per pinrsb xmm, [mem], imm8
, one of them for port 5, one for the load ports).
You don't need to shift the vector left, because the integer -> vector insert with merging instructions (PINSR*) take an index for the insert position. (And already require a shuffle uop, so using the same position every time and shifting the vector is no good for performance.)
For this problem: inserting 16 bytes into a vector separately is not the most efficient approach. Assembling them in groups of 4 or 8 in integer registers might be a better way to go.
;; b0 .. b15 are whatever addressing mode you want.
;; if you could get more than 1 of b0..b15 with a single vector load (i.e. there is some locality in the source bytes)
;; then DON'T DO THIS: do vector loads and shuffle + combine (pshufb if needed)
movzx eax, byte [b2] ; break the
mov ah, byte [b3]
shl eax, 16 ; partial-reg merge is pretty cheap on SnB/IvB, but very slow on Intel CPUs before Sandybridge. AMD has no penalty, just (true in this case) dependencies
mov al, byte [b0]
mov ah, byte [b1]
;; 5 uops to load + merge 4 bytes into an integer reg, plus 2x merging costs
movd xmm0, eax # cheaper than pinsrd xmm0, edx, 0. Also zeros the rest of the vector
;alternative strategy using an extra OR, probably not better anywhere: I don't think merging AL and AH is cheaper than merging just AH
;two short dep chains instead of one longer one isn't helpful when we're doing 16 bytes
movzx eax, byte [b4]
mov ah, byte [b5]
movzx edx, byte [b6]
mov dh, byte [b7]
shl edx, 16
or edx, eax
pinsrd xmm0, edx, 1
;; Then repeat for the next two dwords.
...
pinsrd xmm0, edx, 2
...
pinsrd xmm0, edx, 3
You could even keep going in integer regs up to qwords for movq
/ pinsrq
, but 4 separate dep chains and only one shl
per integer reg is probably better.
Update: AH-merging is not free on Haswell/Skylake. The merging uop might even need to issue in a cycle by itself (i.e. using up 4 slots of front-end issue bandwidth.) See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
For other uarches: Why doesn't GCC use partial registers?. Specifically on AMD, and Silvermont, partial-reg writes have a dependency on the full reg. That's exactly what we want here for throughput; no extra merging uop. (This is the case on anything except Intel P6-family and its Sandybridge-family descendent, where partial-register renaming is sometimes helpful but in this case harmful.)
If you can't assume SSE4, then you could use pinsrw (SSE2). Or maybe it would be better to use movd
and shuffle vectors together with PUNPCKLDQ / PUNPCKLDQD. (That link is to an HTML extract from Intel's manuals).
See Agner Fog's Optimizing Assembly guide (and instruction tables/microarch guide) to decide what sequence of instructions would actually be good.