3

I have a zeroed 128 bit register I want to shift left and add a byte. I can shift it with:

pslldq xmm0, 1 

...but now I want to copy al into the empty space. Something like:

or xmm0, al

which of course doesn't work. I only want the lowest 8 bits affected. This will be in a loop where successive values of al will be used to fill the register. So I need some kind of mov instruction or other alternative.

The ideal would be a single instruction to shift left 8 bits and insert but I don't think such exists.

I have spent a lot of time rummaging around in the x86-64 instruction set data but can't find anything that will allow me to do what I want. Can it be done?

UPDATE: I found an error in my code logic after trying pinsrb. pinsrb would be great but unfortunately it can only use an immediate index, not a register.

I'm taking bytes from non contiguous locations so I think I need to do it a byte at a time. The number of bytes can be anywhere from 1 to 16. The first byte I grab should end up in the lowest byte of xmm0, the next byte goes into the next lowest etc.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
poby
  • 1,510
  • 13
  • 33
  • 1
    you want SSE4.1 `pinsrb xmm0, eax, 1`, but repeating that 16 times is slow. Instead of shifting the vector each time, just use it with 16 different indices. – Peter Cordes Sep 17 '16 at 23:55
  • Unroll your insert loop (keeping the tests for exit) so you can use pinsrb with index = 0, 1, 2, ... You could almost certainly do something more efficient (especially if you know ahead of time how many total bytes you will insert), but that will work. – Peter Cordes Sep 18 '16 at 00:40
  • 1
    I can't give you any more specific advice about what would be optimal, because there are too many unknowns about the surrounding code (e.g. are you bottlenecking on shuffle throughput, latency, uop throughput, cache misses... And do you need a lot of these byte-gathers? Or is there a lot of other computation besides that?) In some circumstances, it might even be optimal to copy bytes into a 16B scratch array and do a vector load from that (e.g. if the latency of a store-forwarding failure wasn't a problem, and all those stores weren't a problem). – Peter Cordes Sep 18 '16 at 00:48
  • I think on Haswell or later, or on AMD, doing some merging in integer registers before inserting into XMM would be a really good idea. – Peter Cordes Sep 18 '16 at 00:49

1 Answers1

6

Intel's intrinsics guide can be useful for finding vector instructions. It lists the asm mnemonic as well as the intrinsic (and you can search by mnemonic instead of intrinsic, since the search matches on the whole text of the entry).

Intel's PDF reference manual also has an index. The insn set ref manual is volume 2. See links to Intel's manuals in the tag wiki.


SSE4.1 PINSRB could do exactly what you asked, but that will bottleneck on one shuffle per clock on Haswell and later, not achieving 2 load per clock throughput. (2 uops per pinrsb xmm, [mem], imm8, one of them for port 5, one for the load ports).

You don't need to shift the vector left, because the integer -> vector insert with merging instructions (PINSR*) take an index for the insert position. (And already require a shuffle uop, so using the same position every time and shifting the vector is no good for performance.)

For this problem: inserting 16 bytes into a vector separately is not the most efficient approach. Assembling them in groups of 4 or 8 in integer registers might be a better way to go.

;; b0 .. b15 are whatever addressing mode you want.
;; if you could get more than 1 of b0..b15 with a single vector load (i.e. there is some locality in the source bytes)
;; then DON'T DO THIS: do vector loads and shuffle + combine (pshufb if needed)

movzx  eax, byte [b2]   ; break the
mov    ah,  byte [b3]
shl    eax, 16         ; partial-reg merge is pretty cheap on SnB/IvB, but very slow on Intel CPUs before Sandybridge.  AMD has no penalty, just (true in this case) dependencies
mov    al,  byte [b0]
mov    ah,  byte [b1]
    ;; 5 uops to load + merge 4 bytes into an integer reg, plus 2x merging costs
movd   xmm0, eax      # cheaper than pinsrd xmm0, edx, 0.  Also zeros the rest of the vector

;alternative strategy using an extra OR, probably not better anywhere: I don't think merging AL and AH is cheaper than merging just AH
;two short dep chains instead of one longer one isn't helpful when we're doing 16 bytes
movzx  eax, byte [b4]
mov    ah,  byte [b5]
movzx  edx, byte [b6]
mov    dh,  byte [b7]
shl    edx, 16
or     edx, eax
pinsrd xmm0, edx, 1

;; Then repeat for the next two dwords.
...
pinsrd xmm0, edx, 2

...
pinsrd xmm0, edx, 3

You could even keep going in integer regs up to qwords for movq / pinsrq, but 4 separate dep chains and only one shl per integer reg is probably better.

Update: AH-merging is not free on Haswell/Skylake. The merging uop might even need to issue in a cycle by itself (i.e. using up 4 slots of front-end issue bandwidth.) See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

For other uarches: Why doesn't GCC use partial registers?. Specifically on AMD, and Silvermont, partial-reg writes have a dependency on the full reg. That's exactly what we want here for throughput; no extra merging uop. (This is the case on anything except Intel P6-family and its Sandybridge-family descendent, where partial-register renaming is sometimes helpful but in this case harmful.)


If you can't assume SSE4, then you could use pinsrw (SSE2). Or maybe it would be better to use movd and shuffle vectors together with PUNPCKLDQ / PUNPCKLDQD. (That link is to an HTML extract from Intel's manuals).

See Agner Fog's Optimizing Assembly guide (and instruction tables/microarch guide) to decide what sequence of instructions would actually be good.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606