Conditional move (cmov) for AVX vector registers based on scalar integer condition?

Question

For 64-bit registers, there is the CMOVcc A, B instruction, that only writes B to A if condition cc is satisfied:

; Do rax <- rdx iff rcx == 0
test rcx, rcx
cmove rax, rdx

However, I wasn't able to find anything equivalent for AVX. I still want to move depending on the value of RFLAGS, just with larger operands:

; Do ymm1 <- ymm2 iff rcx == 0
test rcx, rcx
cmove ymm1, ymm2  (invalid)

Is there an AVX equivalent for cmov? If not, how can I achieve this operation in a branchless way?

There is no such instruction. You can achieve the desired effect using blend instructions; you just have to create a bit mask indicating the desired condition instead of setting flags. — fuz, Feb 02 '21 at 18:09
@fuz Alright, I've already strongly suspected this - thanks for confirming! Yeah, during my research I've also taken a look at the blend instructions, but could not come up with an efficient solution for generating the bit mask itself, yet. But since this turns out to probably be the way to go, I will give it another look. — janw, Feb 02 '21 at 18:35
Broadcasting a flag from rflags into a vector is annoying, hopefully this is used in a context where it can be avoided (for example basing the mask on whether elements of a vector are zero) — harold, Feb 02 '21 at 18:52
@janw With some more context it might be possible to suggest a solution for your specific case. Also, can you use AVX2 or just AVX? — fuz, Feb 02 '21 at 18:53
@fuz Sure. My use case are oblivious (constant time) memory accesses: Given 32-byte blocks `B_0, ..., B_n` with addresses `a_0, ..., a_n`, I want to load a specific block `B_x` with address `a_x`. However, I also need to access all other blocks `B_i`, but discard the results. So, if `a_i == a_x`, I copy the loaded value to another register; else, I ignore it. Thus, in summary, the bit mask would need to depend on the value of a general purpose register. — janw, Feb 02 '21 at 19:32
AVX2 instructions are also fine, but AVX512 is not available in my case (if there is a clean solution for AVX512, this would still be interesting). — janw, Feb 02 '21 at 19:33
@janw you can broadcast the block address into all elements of a YMM register using `vpbroadcastq`. Also broadcast the address of the desired block. Then, compare with `vpcmpeqq` to get a 0 where the address matches and a -1 where it doesn't. — fuz, Feb 02 '21 at 19:39
@fuz Oh, this looks like a good approach, and a nice workaround. I will try this, thank you! (and I finally got to ask my first XY problem, as it looks... ;) ) — janw, Feb 02 '21 at 19:47
@fuz I implemented the proposed solution, and it works well. Do you mind turning this into an answer? I think the approach is still generic enough to fit the question as it is (although it supports fewer conditions than `cmov`). Else I will self answer :) — janw, Feb 03 '21 at 19:43
@janw Feel free to self-answer. I'm currently busy with other stuff. — fuz, Feb 03 '21 at 20:35

janw · Answer 1 · 2021-05-02T20:47:30.640

While there is no vectorized version of cmov, one can achieve an equivalent functionality using a bit mask and blending.

Assume we have two 256-bit vectors value1 and value2, which reside in corresponding vectors registers ymm1 and ymm2:

align 32
value1: dq 1.0, 2.0, 3.0, 4.0
value2: dq 5.0, 6.0, 7.0, 8.0

; Operands for our conditional move
vmovdqa ymm1, [rel value1]
vmovdqa ymm2, [rel value2]

We want to compare two registers rcx and rdx:

; Values to compare
mov rcx, 1
mov rdx, 2

If they are equal, we want to copy ymm2 into ymm1 (and thus select value2), else we want to keep ymm1 and thus value1.

Equivalent (invalid) notation using cmov:

cmp rcx, rdx
cmove ymm1, ymm2  (invalid)

First, we load rcx and rdx into vector registers and broadcast them, so they are copied to all 64-bit chunks of the respective register (. depicts a concatenation):

vmovq xmm0, rcx          ; xmm0 <- 0 . rcx
vpbroadcastq ymm1, xmm0  ; ymm1 <- rcx . rcx . rcx . rcx
vmovq xmm0, rdx          ; xmm0 <- 0 . rdx
vpbroadcastq ymm2, xmm0  ; ymm2 <- rdx . rdx . rdx . rdx

Then, we generate a mask using vpcmpeqq:

; If rcx == rdx:  ymm0 <- ffffffffffffffff.ffffffffffffffff.ffffffffffffffff.ffffffffffffffff
; If rcx != rdx:  ymm0 <- 0000000000000000.0000000000000000.0000000000000000.0000000000000000
vpcmpeqq ymm0, ymm1, ymm2

Finally, we blend ymm2 into ymm1, using the mask in ymm0:

; If rcx == rdx: ymm1 <- ymm2
; If rcx != rdx: ymm1 <- ymm1
vpblendvb ymm1, ymm1, ymm2, ymm0

Thanks to @fuz, who outlined this approach in the comments!

You can broadcast *after* comparing, and combine the integer values into 0 / non-0 before copying to an XMM reg. `xor rcx, rdx` / `vmovq xmm0, rcx` / `vpxor xmm3, xmm3, xmm3` (hoistable) / `vcmpeqq xmm0, xmm0, xmm3` (-1 if RCX==RDX, else 0) / `vpbroadcastq ymm0, xmm0` / blend two other vectors using ymm0. As well as saving 1 total instruction, it makes some of them cheaper (less competition for the same execution port, e.g. scalar xor isn't SIMD at all) and off the critical path (xor-zeroing). And in a loop, you can prepare a zeroed vector outside the loop. — Peter Cordes, May 03 '21 at 05:13

score 2 · Accepted Answer · answered May 03 '21 at 05:46

Given this branchy code (which will be efficient if the condition predicts well):

    cmp rcx, rdx
    jne  .nocopy
     vmovdqa  ymm1, ymm2       ;; copy if RCX==RDX
.nocopy:

We can do it branchlessly by creating a 0 / -1 vector based on the compare condition, and blending on it. Some optimizations vs. the other answer:

Broadcast after XMM compare, so you don't need to broadcast both inputs. Saves an instruction, and makes the compare only XMM (saves a uop on Zen1).
Reduce the integer inputs to one integer if you can do it cheaply. So you only need to copy one thing from integer to XMM regs. Scalar xor can run on any execution port, while vmovd/q xmm, reg can only run on a single execution port on Intel: port 5, the same one needed by vector shuffles like vpbroadcastq ymm, xmm.

As well as saving 1 total instruction, it makes some of them cheaper (less competition for the same execution port, e.g. scalar xor isn't SIMD at all) and off the critical path (xor-zeroing). And in a loop, you can prepare a zeroed vector outside the loop.

;; inputs: RCX, RDX.  YMM1, YMM2
;; output: YMM0

   xor      rcx, rdx        ; 0 or non-0.
   vmovq    xmm0, rcx
         vpxor xmm3, xmm3, xmm3   ; can be done any time, e.g. outside a loop
   vcmpeqq  xmm0, xmm0, xmm3      ; 0 if RCX!=RDX,  -1 if RCX==RDX

   vpbroadcastq ymm0, xmm0
   vpblendvb    ymm0, ymm1, ymm2, ymm0   ; ymm0 = (rcx==rdx) ? ymm2 : ymm1

Destroying the old RCX means you might need a mov, but this is still worth it.

A condition like rcx >= rdx (unsigned) could be done with cmp rdx, rcx / sbb rax,rax to materialize a 0 / -1 integer (which you can broadcast without needing vpcmpeqq).

A signed-greater-than condition is more of a pain; you might end up wanting 2x vmovq for vpcmpgtq, instead of cmp/setg/vmovd / vpbroadcastb. Especially if you don't have a convenient register to setg into to avoid a possible false dependency. setg al / read EAX isn't a problem for partial register stalls: CPUs new enough to have AVX2 don't rename AL separately from the rest of RAX. (Only Intel ever did that, and doesn't in Haswell.) So anyway, you could just setcc into the low byte of one of your cmp inputs.

Note that vblendvps and vblendvpd only care about the high byte of each dword or qword element. If you have two correctly sign-extended integers, and subtracting them won't overflow, c - d will be directly usable as your blend control, just broadcast that. FP blends between integer SIMD instructions like vpaddd have an extra 1 cycle of bypass latency on input and output, on Intel CPUs with AVX2 (and maybe similar on AMD), but the instruction you save will also have latency.

With unsigned 32-bit numbers, you're likely to have them already zero-extended to 64-bit in integer regs. In that case, sub rcx, rdx could set the MSB of RCX identically to how cmp ecx, edx would set CF. (And remember that the FLAGS condition for jb / cmovb is CF == 1)

;; unsigned 32-bit compare, with inputs already zero-extended
   sub   rcx, rdx               ; sets MSB = (ecx < edx)
   vmovq xmm0, rcx
   vpbroadcastq   ymm0, xmm0

   vblendvpd      ymm0, ymm1, ymm2, ymm0   ; ymm0 = ecx<edx ? ymm2 : ymm1

But if your inputs are already 64-bit, and you don't know that their range is limited, you'd need a 65-bit result to fully capture a 64-bit subtraction result.

That's why the condition for jl is SF != OF, not just a-b < 0 because a-b is done with truncating math. And the condition for jb is CF == 1 (instead of the MSB).

Well, that are some nice optimizations, and interesting insights in addition! Honestly, I don't know how I missed that the fact that I could do the integer comparison directly in the 64-bit registers, and save a broadcast instruction on top. — janw, May 03 '21 at 07:13
@janw: `xor` to get 0/non-0 is not something you usually need on x86 since the ISA has FLAGS; it comes up as part of `(int)(x!=y)` on MIPS / RISC-V, though. ([How to do less than or equal in Assembly Language(MIPS)?](https://stackoverflow.com/a/65974123)). Some good optimizations are obvious in hindsight, and I've certainly overlooked my own share of those in other cases. :P — Peter Cordes, May 03 '21 at 07:27

Conditional move (cmov) for AVX vector registers based on scalar integer condition?

2 Answers2