-1

with x being some integer stored in ebx... How does one rotate the 4 most significant bits by 1, while preserving the 4 least significant bits? where 0xABCDEF12 is rotated to 0xDABCEF12

kmin
  • 11
  • 1
  • Not sure what the most efficient is, but you could rotate that to the low byte, then rotate the low byte, then rotate back. *Oh you seem to not mean bits ... well the logic is the same. – Jester Feb 10 '18 at 00:34
  • You mean like the `ror` instruction on just the 32-bit portion of the register? Otherwise, save the 4 bits to a register, shift it down 4, and re or in back in – Michael Dorgan Feb 10 '18 at 00:37
  • Yes you can do `ror ebx, 16` to swap the lower 16 bit and upper 16 bits. Then you could do a `ror bx, 4` to rotate the the lower 4 bits into the upper 4 bits of BX.Then you you can do another `ror ebx, 16` to swap the upper 16 bits with the lowr 16 bits again. – Michael Petch Feb 10 '18 at 01:43

1 Answers1

5

You're talking about nibbles (hex digits), not bits. 4 nibbles is 16 bits, and x86 does have 16-bit operand-size available for rotate, so you just need to put the bits you want to rotate in the low 16 of a register.

bswap  ebx      # ABCDEF12 -> 12 EF CD AB
ror    bx, 4    # CDAB -> BCDA   (high half unmodified)
bswap  ebx      # 12 EF BC DA -> DABCEF12   (partial-register stall on Core2 / Nehalem)

This is efficient on all x86 CPUs other than Intel P6-family, where the partial-register stall sucks (from reading EBX after writing BX).

Also note that bswap r32 on Core2 and earlier Intel P6 CPUs is 2 uops, thus slower than ror r32, imm8. But you'd avoid this because of the partial-register stall on P6-family anyway. On Skylake, for example, bswap is nice for throughput because it runs on p1 / p5, while rotate runs on p0 / p6, so if you're bottlenecked on throughput of this sequence, rather than latency, it can overlap with itself. If you're overlapping mostly with other surrounding code (not this in a right loop), then you can choose between ror ebx,16 or bswap ebx to balance execution-port pressure if necessary.

Of course, if you were doing just this in a tight loop over an array, don't load the whole element in the first place, just ror word [mem+2], 4 to rotate the high word of a dword in memory. (But don't do this right before loading that array element, because it would cause a store-forwarding stall from the 16-bit store at the end of the read-modify-write forwarding to a wider 32-bit load. Memory-destination rotate is only a good idea if the value is staying in memory and that's all you're doing with it for now.)


Alternatively, you could shift, mask, and OR to put bits where they belong. I think this would take more instructions and longer than a 3-cycle latency chain. (Or 4-cycle on Sandybridge pre-Ivybridge, where AX is still renamed separately from RAX, but a merging uop can be inserted without stalling.) But do this anyway if you need it to be efficient on Nehalem.


AVX512F has variable-count rotate (VPRORVD, but not for 16-bit element size (not even with AVX512BW or AVX512VBMI), otherwise you could use a count vector that rotated the top word of each dword by 4, but the bottom word by 0.

AVX512VBMI2 (expected in Ice Lake) has a SIMD version of SHLD, which you can use as a rotate: VPSHRDVW works on word elements:

section .rodata
    rotate_constant:  dw 0, 4

section .text
vpbroadcastd   xmm1, [rotate_constant]   ; 32-bit broadcast of [4, 0]

# rotate the high 16-bit of every dword element in xmm0 (or ymm0 or zmm0)
vpshrdvw       xmm0,xmm0, xmm1

vpshrdvw can't use a broadcast memory operand anyway (unlike the dword and qword versions), and if it could it would be a 16-bit broadcast, not 32-bit.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • 1
    Or how about `bswap ebx ; ror bx, 4 ; bswap ebx`? It's a whopping 2 bytes shorter! – David Wohlferd Feb 10 '18 at 09:55
  • @DavidWohlferd: Yeah, that's good except on Core2 and earlier, where `bswap` is 2 uops / 4c latency even for 32-bit operand size. (But partial-register stalls there mean this sucks there anyway.) (64-bit operand-size `bswap` is still 2c latency on Skylake, but 32-bit is 1 uop / 1c, and runs on different ports than rotate so it's actually an excellent choice. It's good on AMD, too). I was thinking that `bswap` reversing the bytes within the part you rotate is a problem, but no `ABCDEF12` -> `12EFCDAB` -> rotate to `BCDA` -> bswap back to `DABC` as desired, so it does work. – Peter Cordes Feb 10 '18 at 10:13
  • @PeterCordes as always shows that his knowledge is not limited ;) – Gilgamesz Mar 04 '18 at 12:37