what is most efficient code to sign-extend large integers?

Question

I am writing a code library in x86-64 assembly-language to provide all conventional bitwise, shift, logical, compare, arithmetic and math functions for s0128, s0256, s0512, s1024, s2048, and s4096 signed-integer types and f0128, f0256, f0512, f1024, f2048, and f4096 floating-point types.

Now I am writing some type conversion routines, and have run into something that should be trivial but takes a lot more instructions than I would expect. I feel like I must be missing something (some instructions) to make this easier, but so far no luck.

The low 128-bits of the s0256 result is simply a copy of the s0128 input argument, and all the bits in the upper 128-bits of the s0256 result must be set to the most-significant bit in the s0128 input argument.

Simple, huh? But here is the best I can figure so far to convert s0256 to s0128. Ignore the first 4 lines (they're just argument error checks) and last 2 lines (returning from function with no error (rax == 0)). The 5 lines in the middle are the algorithm in question. Try to avoid [conditional] jump instructions.

.text
.align 64
big_m63:
.quad  -63, -63                       # two shift counts for vpshaq instruction

big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpshaq     big_m63, %xmm1, %xmm1    # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

This routine is also non-optimal in that every instruction requires the result of the previous instruction, which prevents parallel execution of any instructions.

Is there a better instruction to right-shift with sign extension? I cannot find an instruction like vpshaq that accepts an immediate byte to specify shift-count, though I don't know why (many SIMD instructions have immediate 8-bit operands for various purposes). Also, Intel does not support vpshaq. Oops!

But look! StephenCanon has a brilliant solution to this problem below! Awesome! That solution has one more instruction than the above, but the vpxor instruction can be put after the first vmovapd instruction and should effectively take no more cycles than the 5 instruction version above. Bravo!

For completeness and easy comparison, here is the code with the latest StephenCanon enhancement:

.text
.align 64
big_s0256_eq_s0128:    # (s0256* arg0, const s0128* arg1); # s0256 = s0256(s0128)
  orq        %rdi, %rdi               # is arg0 a valid address ???
  jz         error_argument_invalid   # nope
  orq        %rsi, %rsi               # is arg1 a valid address ???
  jz         error_argument_invalid   # nope

  vmovapd    (%rsi), %xmm0            # ymm0 = arg1.ls64 : arg1.ms64 : 0 : 0
  vpxor      %xmm2, %xmm2, %xmm2      # ymm2 = 0 : 0 : 0 : 0
  vmovhlps   %xmm0, %xmm0, %xmm1      # ymm1 = arg1.ms64 : arg1.ms64 : 0 : 0
  vpcmpgtq   %xmm1, %xmm2, %xmm1      # ymm1 = arg1.sign : arg1.sign : 0 : 0
  vperm2f128 $32, %ymm1, %ymm0, %ymm0 # ymm1 = arg1.ls64 : arg1.ms64 : sign : sign
  vmovapd    %ymm0, (%rdi)            # arg0 = arg1 (sign-extended to 256-bits)

  xorq       %rax, %rax               # rax = 0 == no error
  ret                                 # return from function

I'm not certain, but not needing to read those two 64-bit shift-counts from memory might also speed the code up slightly. Nice.

use `test %rdi, %rdi` / `jz` to branch on a register being zero. That [can macro-fuse into one test-and-branch uop on both AMD and Intel](https://stackoverflow.com/questions/33721204/test-whether-a-register-is-zero-with-cmp-reg-0-vs-or-reg-reg/33724806#33724806), and avoids putting an extra cycle of latency into the dep chain leading to the load. Or better, require your caller to pass valid args, so you simply segfault on bad pointers. — Peter Cordes, Oct 21 '20 at 08:10

Stephen Canon · Accepted Answer · 2014-01-12T22:50:03.807

4

You're over-complicating things. Once you have the sign in rax, just do two 64b stores from there instead of trying to assemble the result in ymm0. One less instruction and a much shorter dependency chain.

As the destination type gets larger, of course, it makes sense to use wider stores (AVX). With AVX2 you can use vbroadcastq to do the splat more efficiently, but it looks like you're targeting baseline AVX?

I should also note that once you get to ~512b integers, for most algorithms the cost of super-linear operations like multiplication so completely dominates the running time that squeezing every last cycle out of operations like sign extension rapidly starts to lose value. It's a good exercise, but ultimately not the most productive use of your time once your implementations are "good enough”.

After further thought, I have the following suggestion:

vmovhlps  %xmm0, %xmm0, %xmm1 // could use a permute instead to stay in integer domain.
vpxor     %xmm2, %xmm2, %xmm2
vpcmpgtq  %xmm1, %xmm2, %xmm2 // generate sign-extension without shift

This has the virtues of (a) not requiring a constant load and (b) working on both Intel and AMD. The xor to generate zero looks like an extra instruction, but in practice this zeroing idiom doesn’t even require an execute slot on recent processors.

FWIW, if targeting AVX2, I might write it like this:

vmovdqa (%rsi),        %xmm0 // { x0, x1, 0,  0  }
vpermq   $0x5f, %ymm0, %ymm1 // { 0,  0,  x1, x1 }
vpxor    %ymm2, %ymm2, %ymm2 // { 0,  0,  0,  0  }
vpcmpgtq %ymm1, %ymm2, %ymm2 // { 0,  0,  s,  s  } s = sign extension
vpor     %ymm2, %ymm0, %ymm0 // { x0, x1, s,  s  }
vmovdqa  %ymm0,       (%rdi)

Unfortunately, I don’t think that vpermq is available on AMD.

edited Jan 12 '14 at 22:50

answered Jan 12 '14 at 16:44

Stephen Canon

97,302
18
172
256

I was not aware that `AVX2` was available yet. The CPUs I have to test with are the `FX8150` and `FX8350`. I'm willing to adopt any instruction these CPUs have. Yes, the data-types I am supporting are `s0128`, `s0256`, `s0512`, `s1024`, `s2048`, `s4096` integers and `f0128`, `f0256`, `f0512`, `f1024`, `f2048`, `f4096` floating-point. For larger sizes the best approach changes as the number of 64-bit chunks to fill with the sign-bit becomes large. That is why I tried to do as much as possible in the `SIMD` `ymm` registers, which hopefully is more efficient for the larger data-types. – honestann Jan 12 '14 at 17:04
@honestann: right, for the big data types you'll want to splat out the signbit with AVX as in your example. For the specific case of 128->256 (which you happened to use here), it a bit tidier to just store out of the GPRs. AVX2 is available in the Intel "Haswell" microarchitecture, but isn't available in AMD parts yet, so it's not an option for you. – Stephen Canon Jan 12 '14 at 17:12
@StevenCanon: Also I was a bit worried about causing the cache circuitry in the CPU hassles and delays by repetitively writing out different values into a single 64-byte line of the cache. With all the merging that has to be done, I worry that spitting bits and pieces of a single cache-line to the cache (especially on consecutive instructions) might force the cache circuitry to impose additional delays. I'd hate to be the poor bastard who has to design the cache circuitry to handle such situations! – honestann Jan 12 '14 at 17:31
Most caches are extremely adept at handling adjacent 64b stores. You do run into trouble on some architectures when if you issue *overlapping* stores, and you do need to be careful about storing and then immediately reloading data when the access sizes are different, but for your usage here, you'd be pretty safe. – Stephen Canon Jan 12 '14 at 17:34
BTW, if you are in a position to easily verify that the `vpshaq` instruction does or does not work properly on your CPU, that would help me a lot. On my CPUs, only the low 64-bits of the source operand gets shifted. The documentation says both 64-bit portions of the source `xmm` register should be shifted, and it the instruction name indicates both should (as a "packed" instruction). If you have Intel CPUs, it would be good to know whether they shift both 64-bit portions or only one. – honestann Jan 12 '14 at 17:39
1

The shift amount for each lane needs to be present in the corresponding lane of the shift count vector. So if you want to shift without first doing the `movhlps`, you need to have {x,63} for your shift count vector. Right now your code is using memory you haven’t set for the high lane of the shift count, though it’s likely to be zero. Be warned that `vpshaq` isn’t available on Intel procs, it’s an AMD extension. – Stephen Canon Jan 12 '14 at 21:25
Doh! Yes, my mistake! I thought the `vpshaq` instruction only took one shift specification, but now I see, the `vpshaq` instruction takes one 64-bit shift-count field for each 64-bit field it shifts. I corrected the code. Though it looks a lot nicer now, with only 5 instructions, every instruction has to wait for the result of the previous instruction, so it probably has to stall for a couple cycles before it starts each of the 5 instructions. Hmmmm... I wonder whether there is a good way with instructions on both INTEL and AMD CPUs. `movhlps` is needed so we create 128-bits of sign bit. – honestann Jan 12 '14 at 21:51

what is most efficient code to sign-extend large integers?

1 Answers1