4

Let's say you want to find the first occurrence of a value1 in a sorted array. For small arrays (where things like binary search don't pay off), you can achieve this by simply counting the number of values less than that value: the result is the index you are after.

In x86 you can use adc (add with carry) for an efficient branch-free2 implementation of that approach (with the start pointer in rdi length in rsi and the value to search for in edx):

  xor eax, eax
  lea rdi, [rdi + rsi*4]  ; pointer to end of array = base + length
  neg rsi                 ; we loop from -length to zero

loop:
  cmp [rdi + 4 * rsi], edx
  adc rax, 0              ; only a single uop on Sandybridge-family even before BDW
  inc rsi
  jnz loop

The answer ends up in rax. If you unroll that (or if you have a fixed, known input size), only the cmp; adc pair of instructions get repeated, so the overhead approaches 2 simple instructions per comparison (and the sometimes fused load). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

However, this only works for unsigned comparisons, where the carry flag holds the result of the comparison. Is there any equivalently efficient sequence for counting signed comparisons? Unfortunately, there doesn't seem to be an "add 1 if less than" instruction: adc, sbb and the carry flag are special in that respect.

I am interested in the general case where the elements have no specific order, and also in this case where the array is sorted in the case the sortedness assumption leads to a simpler or faster implementation.


1 Or, if the value doesn't exist, the first greater value. I.e., this is the so called "lower bound" search.

2 Branch free approaches necessarily do the same amount of work each time - in this case examining the entire array, so this approach only make sense when the arrays are small and so the cost of a branch misprediction is large relative to the total search time.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • `cmp; jge` is slower for you than going through entire array? – zch Jan 20 '19 at 23:26
  • 1
    @zch - yes, in the cases I am interested in the arrays are small and the element position unpredictable so a single mispredict ends up slower than branch-free approaches. – BeeOnRope Jan 20 '19 at 23:41
  • I think you can do this with a carry flag if you store numbers in your array with sign bit inverted. – zch Jan 21 '19 at 00:04
  • @zch - yes, something like that crossed my mind. I can adjust the needle value, but unfortunately I cannot change the format of the numbers in the array. – BeeOnRope Jan 21 '19 at 00:55
  • "I'm only interested in non-SIMD solutions here." → Why? – Veedrac Jan 21 '19 at 02:12
  • 1
    @Veedrac - at a high level, just to limit the scope of the question and to keep the "signed" vs "unsigned" part relevant. I think the ideal SIMD solution is straightforward: `cmpgt` (possibly with adjustment to account for unsigned values if you nee unsigned), `pmovmsk` and `popcnt` or something like that, with the usual tricks of one overlapping unaligned load and masking out the overlapped results somehow to allow for arrays which are not a multiple of the vector size. I might ask another question about that if it interests you. – BeeOnRope Jan 21 '19 at 02:18
  • 1
    I'll note that you seem to have misconceptions about binary search; I strongly recommend reading https://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/. "Linear search should only be used when it’s expected to bail out quickly. The situations in which that property is most useful involve uncached, medium-size vectors." – Veedrac Jan 21 '19 at 02:22
  • 1
    @Veedrac - I'm well aware of that post, it is excellent! I don't think it contradicts anything I've said. Paul is actually considering "branchy" linear search versus both branchy and branchless binary search and reaches the conclusion you quoted. There is also branchless linear search I'm talking about here. It wins for very small arrays with predictable sizes (branchless binary search loses because of the data-dependency between the accesses, while branchy search loses because even one mispredict is longer than the other searches take in total). – BeeOnRope Jan 21 '19 at 02:26
  • You haven't ever said what 'very small' means. 8 elements? 32? – Veedrac Jan 21 '19 at 02:37
  • 1
    @Veedrac in the range of 0 to 32. – BeeOnRope Jan 21 '19 at 02:45
  • Even a `cmov` version would be slower I suppose? – Margaret Bloom Jan 21 '19 at 10:10
  • @BeeOnRope Then doing a binary search over 8-elem chunks and then linear search within that should be a significant bit faster, if I understand Paul Khuong's measurements right. – Veedrac Jan 21 '19 at 14:54
  • @MargaretBloom - do you mean a `cmov` version of this linear search, or of a binary search? – BeeOnRope Jan 23 '19 at 01:24

4 Answers4

4

PCMPGT + PADDD or PSUBD is probably a really good idea for most CPUs, even for small sizes, maybe with a simple scalar cleanup. Or even just purely scalar, using movd loads, see below.

For scalar integer, avoiding XMM regs, use SETCC to create a 0/1 integer from any flag condition you want. xor-zero a tmp register (potentially outside the loop) and SETCC into the low 8 of that, if you want to use 32 or 64-bit ADD instructions instead of only 8-bit.

cmp/adc reg,0 is basically a peephole optimization for the below / carry-set condition. AFAIK, there is nothing as efficient for signed-compare conditions. At best 3 uops for cmp/setcc/add, vs. 2 for cmp/adc. So unrolling to hide loop overhead is even more important.

See the bottom section of What is the best way to set a register to zero in x86 assembly: xor, mov or and? for more details about how to zero-extend SETCC r/m8 efficiently but without causing partial-register stalls. And see Why doesn't GCC use partial registers? for a reminder of partial-register behaviour across uarches.


Yes, CF is special for a lot of things. It's the only condition flag that has set/clear/complement (stc/clc/cmc) instructions1. There's a reason that bt/bts/etc. instructions set CF, and that shift instructions shift into it. And yes, ADC/SBB can add/sub it directly into another register, unlike any other flag.

OF can be read similarly with ADOX (Intel since Broadwell, AMD since Ryzen), but that still doesn't help us because it's strictly OF, not the SF!=OF signed-less-than condition.

This is typical for most ISAs, not just x86. (AVR and some others can set/clear any condition flag because they have an instruction that takes an immediate bit-position in the status register. But they still only have ADC/SBB for directly adding the carry flag to an integer register.)

ARM 32-bit can do a predicated addlt r0, r0, #1 using any condition-code, including signed less-than, instead of an add-with-carry with immediate 0. ARM does have ADC-immediate which you could use for the C flag here, but not in Thumb mode (where it would be useful to avoid an IT instruction to predicate an ADD), so you'd need a zeroed register.

AArch64 can do a few predicated things, including increment with cinc with arbitrary condition predicates.

But x86 can't. We only have cmovcc and setcc to turn conditions other than CF==1 into integers. (Or with ADOX, for OF==1.)

Footnote 1: Some status flags in EFLAGS like interrupts IF (sti/cli), direction DF (std/cld), and alignment-check (stac/clac) have set/clear instructions, but not the condition flags ZF/SF/OF/PF or the BCD-carry AF.


cmp [rdi + 4 * rsi], edx will un-laminate even on Haswell/Skylake because of the indexed addressing mode, and it it doesn't have a read/write destination register (so it's not like add reg, [mem].)

If tuning only for Sandybridge-family, you might as well just increment a pointer and decrement the size counter. Although this does save back-end (unfused-domain) uops for RS-size effects.

In practice you'd want to unroll with a pointer increment.

You mentioned sizes from 0 to 32, so we need to skip the loop if RSI = 0. The code in your question is just a do{}while which doesn't do that. NEG sets flags according to the result, so we can JZ on that. You'd hope that it could macro-fuse because NEG is exactly like SUB from 0, but according to Agner Fog it doesn't on SnB/IvB. So that costs us another uop in the startup if you really do need to handle size=0.


Using integer registers

The standard way to implement integer += (a < b) or any other flag condition is what compilers do (Godbolt):

xor    edx,edx            ; can be hoisted out of a short-running loop, but compilers never do that
                          ; but an interrupt-handler will destroy the rdx=dl status
cmp/test/whatever         ; flag-setting code here
setcc  dl                 ; zero-extended to a full register because of earlier xor-zeroing
add    eax, edx

Sometimes compilers (especially gcc) will use setcc dl / movzx edx,dl, which puts the MOVZX on the critical path. This is bad for latency, and mov-elimination doesn't work on Intel CPUs when they use (part of) the same register for both operands.

For small arrays, if you don't mind having only an 8-bit counter, you could just use 8-bit add so you don't have to worry about zero-extension inside the loop.

; slower than cmp/adc: 5 uops per iteration so you'll definitely want to unroll.

; requires size<256 or the count will wrap
; use the add eax,edx version if you need to support larger size

count_signed_lt:          ; (int *arr, size_t size, int key)
  xor    eax, eax

  lea    rdi, [rdi + rsi*4]
  neg    rsi              ; we loop from -length to zero
  jz    .return           ; if(-size == 0) return 0;

       ; xor    edx, edx        ; tmp destination for SETCC
.loop:
  cmp    [rdi + 4 * rsi], edx
  setl   dl               ; false dependency on old RDX on CPUs other than P6-family
  add    al, dl
       ; add    eax, edx        ; boolean condition zero-extended into RDX if it was xor-zeroed

  inc    rsi
  jnz    .loop

.return:
  ret

Alternatively using CMOV, making the loop-carried dep chain 2 cycles long (or 3 cycles on Intel before Broadwell, where CMOV is 2 uops):

  ;; 3 uops without any partial-register shenanigans, (or 4 because of unlamination)
  ;;  but creates a 2 cycle loop-carried dep chain
  cmp    [rdi + 4 * rsi], edx
  lea    ecx, [rax + 1]        ; tmp = count+1
  cmovl  eax, ecx              ; count = arr[i]<key ? count+1 : count

So at best (with loop unrolling and a pointer-increment allowing cmp to micro-fuse) this takes 3 uops per element instead of 2.

SETCC is a single uop, so this is 5 fused-domain uops inside the loop. That's much worse on Sandybridge/IvyBridge, and still runs at worse than 1 per clock on later SnB-family. (Some ancient CPUs had slow setcc, like Pentium 4, but it's efficient on everything we still care about.)

When unrolling, if you want this to run faster than 1 cmp per clock, you have two choices: use separate registers for each setcc destination, creating multiple dep chains for the false dependencies, or use one xor edx,edx inside the loop to break the loop-carried false dependency into multiple short dep chains that only couple the setcc results of nearby loads (probably coming from the same cache line). You'll also need multiple accumulators because add latency is 1c.

Obviously you'll need to use a pointer-increment so cmp [rdi], edx can micro-fuse with a non-indexed addressing mode, otherwise the cmp/setcc/add is 4 uops total, and that's the pipeline width on Intel CPUs.

There's no partial-register stall from the caller reading EAX after writing AL, even on P6-family, because we xor-zeroed it first. Sandybridge won't rename it separately from RAX because add al,dl is a read-modify-write, and IvB and later never rename AL separately from RAX (only AH/BH/CH/DH). CPUs other than P6 / SnB-family don't do partial-register renaming at all, only partial flags.

The same applies for the version that reads EDX inside the loop. But an interrupt-handler saving/restoring RDX with push/pop would destroy its xor-zeroed status, leading to partial-register stalls every iteration on P6-family. This is catastrophically bad, so that's one reason compilers never hoist the xor-zeroing. They usually don't know if a loop will be long-running or not, and won't take the risk. By hand, you'd probably want to unroll and xor-zero once per unrolled loop body, rather than once per cmp/setcc.


You can use SSE2 or MMX for scalar stuff

Both are baseline on x86-64. Since you're not gaining anything (on SnB-family) from folding the load into the cmp, you might as well use a scalar movd load into an XMM register. MMX has the advantage of smaller code-size, but requires EMMS when you're done. It also allows unaligned memory operands, so it's potentially interesting for simpler auto-vectorization.

Until AVX512, we only have comparison for greater-than available, so it would take an extra movdqa xmm,xmm instruction to do key > arr[i] without destroying key, instead of arr[i] > key. (This is what gcc and clang do when auto-vectorizing).

AVX would be nice, for vpcmpgtd xmm0, xmm1, [rdi] to do key > arr[i], like gcc and clang use with AVX. But that's a 128-bit load, and we want to keep it simple and scalar.

We can decrement key and use (arr[i]<key) = (arr[i] <= key-1) = !(arr[i] > key-1). We can count elements where the array is greater-than key-1, and subtract that from the size. So we can make do with just SSE2 without costing extra instructions.

If key was already the most-negative number (so key-1 would wrap), then no array elements can be less than it. This does introduce a branch before the loop if that case is actually possible.

 ; signed version of the function in your question
 ; using the low element of XMM vectors
count_signed_lt:          ; (int *arr, size_t size, int key)
                          ; actually only works for size < 2^32
  dec    edx                 ; key-1
  jo    .key_eq_int_min
  movd   xmm2, edx    ; not broadcast, we only use the low element

  movd   xmm1, esi    ; counter = size, decrement toward zero on elements >= key
      ;;  pxor   xmm1, xmm1   ; counter
      ;;  mov    eax, esi     ; save original size for a later SUB

  lea    rdi, [rdi + rsi*4]
  neg    rsi          ; we loop from -length to zero

.loop:
  movd     xmm0, [rdi + 4 * rsi]
  pcmpgtd  xmm0, xmm2    ; xmm0 = arr[i] gt key-1 = arr[i] >= key = not less-than
  paddd    xmm1, xmm0    ; counter += 0 or -1
    ;;  psubd    xmm1, xmm0    ; -0  or  -(-1)  to count upward

  inc      rsi
  jnz      .loop

  movd   eax, xmm1       ; size - count(elements > key-1)
  ret

.key_eq_int_min:
  xor    eax, eax       ; no array elements are less than the most-negative number
  ret

This should be the same speed as your loop on Intel SnB-family CPUs, plus a tiny bit of extra overhead outside. It's 4 fuse-domain uops, so it can issue at 1 per clock. A movd load uses a regular load port, and there are at least 2 vector ALU ports that can run PCMPGTD and PADDD.

Oh, but on IvB/SnB the macro-fused inc/jnz requires port 5, while PCMPGTD / PADDD both only run on p1/p5, so port 5 throughput will be a bottleneck. On HSW and later the branch runs on port 6, so we're fine for back-end throughput.

It's worse on AMD CPUs where a memory-operand cmp can use an indexed addressing mode without a penalty. (And on Intel Silvermont, and Core 2 / Nehalem, where memory-source cmp can be a single uop with an indexed addressing mode.)

And on Bulldozer-family, a pair of integer cores share a SIMD unit, so sticking to integer registers could be an even bigger advantage. That's also why int<->XMM movd/movq has higher latency, again hurting this version.


Other tricks:

Clang for PowerPC64 (included in the Godbolt link) shows us a neat trick: zero or sign-extend to 64-bit, subtract, and then grab the MSB of the result as a 0/1 integer that you add to counter. PowerPC has excellent bitfield instructions, including rldicl. In this case, it's being used to rotate left by 1, and then zero all bits above that, i.e. extracting the MSB to the bottom of another register. (Note that PowerPC documentation numbers bits with MSB=0, LSB=63 or 31.)

If you don't disable auto-vectorization, it uses Altivec with a vcmpgtsw / vsubuwm loop, which I assume does what you'd expect from the names.

# PowerPC64 clang 9-trunk -O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=power9
# signed int version

# I've added "r" to register names, leaving immediates alone, because clang doesn't have `-mregnames`

 ... setup
.LBB0_2:                  # do {
    lwzu   r5, 4(r6)         # zero-extending load and update the address register with the effective-address.  i.e. pre-increment
    extsw  r5, r5            # sign-extend word (to doubleword)
    sub    r5, r5, r4        # 64-bit subtract
    rldicl r5, r5, 1, 63    # rotate-left doubleword immediate then clear left
    add    r3, r3, r5        # retval += MSB of (int64_t)arr[i] - key
    bdnz .LBB0_2          #  } while(--loop_count);

I think clang could have avoided the extsw inside the loop if it had used an arithmetic (sign-extending) load. The only lwa that updates the address register (saving an increment) seems to be the indexed form lwaux RT, RA, RB, but if clang put 4 in another register it could use it. (There doesn't seem to be a lwau instruction.) Maybe lwaux is slow or maybe it's a missed optimization. I used -mcpu=power9 so even though that instruction is POWER-only, it should be available.

This trick could sort of help for x86, at least for a rolled-up loop. It takes 4 uops this way per compare, not counting loop overhead. Despite x86's pretty bad bitfield extract capabilities, all we actually need is a logical right-shift to isolate the MSB.

count_signed_lt:          ; (int *arr, size_t size, int key)
  xor     eax, eax
  movsxd  rdx, edx

  lea     rdi, [rdi + rsi*4]
  neg     rsi          ; we loop from -length to zero

.loop:
  movsxd   rcx, dword [rdi + 4 * rsi]   ; 1 uop, pure load
  sub      rcx, rdx                     ; (int64_t)arr[i] - key
  shr      rcx, 63                      ; extract MSB
  add      eax, ecx                     ; count += MSB of (int64_t)arr[i] - key

  inc      rsi
  jnz      .loop

  ret

This doesn't have any false dependencies, but neither does 4-uop xor-zero / cmp / setl / add. The only advantage here is that this is 4 uops even with an indexed addressing mode. Some AMD CPUs may run MOVSXD through an ALU as well as a load port, but Ryzen has the same latency as for it as for regular loads.

If you have fewer than 64 iterations, you could do something like this if only throughput matters, not latency. (But you can probably still do better with setl)

.loop
  movsxd   rcx, dword [rdi + 4 * rsi]   ; 1 uop, pure load
  sub      rcx, rdx                     ; (int64_t)arr[i] - key
  shld     rax, rcx, 1    ; 3 cycle latency

  inc rsi  / jnz .loop

  popcnt   rax, rax                     ; turn the bitmap of compare results into an integer

But the 3-cycle latency of shld makes this a showstopper for most uses, even though it's only a single uop on SnB-family. The rax->rax dependency is loop-carried.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Excellent answer! I had assumed the base-case was 3 uops for `cmp; setcc; add` but as you point out there is complication around zeroing the `setcc` destination, interrupts and false dependencies. Ugh! The `sub + shld` solution is pretty interesting. You could I suppose use multiple accumulators to hide the `shld` latency. All of your other points are good. I hadn't considered the "only gt" issue in SSE when it comes to clobbering the dest. – BeeOnRope Jan 23 '19 at 02:50
  • @BeeOnRope: I think the best bet is just to vectorize without unrolling, with 128-bit vectors, with startup / cleanup designed for small sizes. (With AVX2 `vpmaskmovd` we can avoid loading past the end of the buffer without doing too much tricky handling for odd sizes and sizes less than 4). With AVX1 for `vpcmpgtd xmm0, xmm4, [rdi]`, it's pretty nice. Working on code for another answer. :P Scalar might still win for sizes of 3 or 4, but SIMD will probably win for size >= 8. If we can assume size<256, we can hsum with `psadbw` instead of unpack / paddd for one step. – Peter Cordes Jan 23 '19 at 03:06
  • `vpmaskmovd` is an interesting idea. Another idea is just to load in a way that is guaranteed not to cross a page. I have been considering the best way to handle these small (smaller than a vector) arrays: there seems to be a lot of solutions, including extra-aligned loads, various unaligned loads, masked loads and various conditional combinations of those (for example, always just load the pointer as passed, except if near the end of the page in which case do a adjusted load with the array ending up at the end of the vector). I wonder if there is one overall best strategy? – BeeOnRope Jan 23 '19 at 03:17
  • BTW, Agner shows `VPMASKMOVD/Q` having one more cycle of of latency than `VMASKMOVPS/D`. Seems unlikely? – BeeOnRope Jan 23 '19 at 03:21
  • @BeeOnRope: I didn't try `vmaskmovps` vs. `vpmaskmovd`. But this hacky NASM version https://godbolt.org/z/3XlE-t of loop hand-vectorized with AVX1 doesn't slow down on SKL when I use a VPMASKMOVD + vpcmpgt in the cleanup instead of a plain 16-byte load, even for small element counts like 8. (I think my `_start` that calls it with `call` is suffering from some kind of weird effect, possibly alignment-related.) If it does actually need to do fault-suppression, it's about 22x slower (like in the source in that Godbolt link, where it starts from `arr+4` near the end of a page with `size=23`). – Peter Cordes Jan 23 '19 at 06:26
  • So fault-suppression from VPMASKMOVD is fine for correctness as long as buffers close to the start of unmapped pages are very rare in practice (on SKL). (Related: I was playing around with it for search/replace, to see if it avoided dirtying a page. It does, but a fully-masked store to a page that's COW mapped to the zero page needs very slow fault-suppression! It was actually faster than an always-store loop when the page had been dirtied first, but those performance pot-holes from slow fault-suppression, or page-dirtying suppression, are nasty.) – Peter Cordes Jan 23 '19 at 06:31
  • if I understood your last comment correctly, you mean that fully-masked store to the COW zero page needs a fault-suppression (I would expect this), but are you also saying it is *slower* than other fault-surpressions (e.g,. unmaked write to unmapped page), or are you just pointing this out since unlike the occasional page-crossing supression, which is rare and only happens once "at the end", in your search/replace scenario you might even write the entire zero page (masked out) so it could be really slow? – BeeOnRope Jan 23 '19 at 17:46
  • Does the avoidance of dirtying cost anything? That is, if you write to a RW page which is clean, is there some kind of assist taken to avoid the dirtying, or does it run at full speed? – BeeOnRope Jan 23 '19 at 17:46
  • 1
    @BeeOnRope: I don't think it's slower, I was just saying that *every* store in the whole loop will suffer from this if looping over a big array that's for any reason actually read-only in HW. Re: avoiding dirtying: I wrote a test program (https://godbolt.org/z/Wna0wD), but didn't finish an edit to an SO answer about what I found. I don't think I ever actually tested the dirty bit, just `mmap(MAP_ANONYMOUS)` and then either writing or not-writing it once before looping. Without writing it, it's the same as if I'd used PROT_READ. I think my earlier comment was mis-remembering what I'd tested. – Peter Cordes Jan 23 '19 at 18:43
3

There's a trick to convert a signed comparison to an unsigned comparison and vice versa by toggling the top bit

bool signedLessThan(int a, int b)
{
    return ((unsigned)a ^ INT_MIN) < b; // or a + 0x80000000U
}

It works because the ranges in 2's complement are still linear, just with a swapped signed and unsigned space. So the simplest way may be XORing before comparison

  xor eax, eax
  xor edx, 0x80000000     ; adjusting the search value
  lea rdi, [rdi + rsi*4]  ; pointer to end of array = base + length
  neg rsi                 ; we loop from -length to zero

loop:
  mov ecx, [rdi + 4 * rsi]
  xor ecx, 0x80000000
  cmp ecx, edx
  adc rax, 0              ; only a single uop on Sandybridge-family even before BDW
  inc rsi
  jnz loop

If you can modify the array then just do the conversion before checking


In ADX there's ADOX that uses carry from OF. Unfortunately signed comparison also needs SF instead of only OF, thus you can't use it like this

  xor ecx, ecx
loop:
  cmp [rdi + 4 * rsi], edx
  adox rax, rcx            ; rcx=0; ADOX is not available with an immediate operand

and must do some more bit manipulations to correct the result

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
phuclv
  • 27,258
  • 11
  • 104
  • 360
2

In the case the array is guaranteed to be sorted, one could use cmovl with an "immediate" value representing the correct value to add. There are no immediates for cmovl, so you'll have to load them into registers beforehand.

This technique makes sense when unrolled, for example:

; load constants
  mov r11, 1
  mov r12, 2
  mov r13, 3
  mov r14, 4

loop:
  xor ecx, ecx
  cmp [rdi +  0], edx
  cmovl rcx, r11
  cmp [rdi +  4], edx
  cmovl rcx, r12
  cmp [rdi +  8], edx
  cmovl rcx, r13
  cmp [rdi + 12], edx
  cmovl rcx, r14
  add rax, rcx
  ; update rdi, test loop condition, etc
  jcc loop

You have 2 uops per comparison, plus overhead. There is a 4-cycle (BDW and later) dependency chain between the cmovl instructions, but it is not carried.

One disadvantage is that you have to set up the 1,2,3,4 constants outside of the loop. It also doesn't work as well if not unrolled (you need to ammortize the add rax, rcx accumulation).

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • If the first 3 are false but the last is true, you'll add 4, not 1. Am I missing something? Or is this still assuming a sorted array, like you mentioned as one of the use-cases in the question? That requires at least a comment in this answer, IMO. I suspect if you're going to introduce the tiny-case-overhead of unrolling by 4, it's going to be worth using SSE2 `movdqu` + `pcmpgtd`, unless you're tuning for CPUs with slow unaligned loads. – Peter Cordes Jan 23 '19 at 06:38
  • @PeterCordes - yes, you are right, this assumes a sorted array. That was my original motivating use case, but then I made the title a bit more general but then used sorted sort as my example in the text, so I agree it wasn't clear if you could assume sorted arrays or not! I'll clarify it. – BeeOnRope Jan 23 '19 at 17:35
1

Assuming the array is sorted, you could make separate code branches for positive and negative needles. You will need a branch instruction at the very start, but after that, you can use the same branch-free implementation you would use for unsigned numbers. I hope that is acceptable.

needle >= 0:

  • go through the array in ascending order
  • begin by counting every negative array element
  • proceed with the positive numbers just like you would in the unsigned scenario

needle < 0:

  • go through the array in descending order
  • begin by skipping every positive array element
  • proceed with the negative numbers just like you would in the unsigned scenario

Unfortunately, in this approach you cannot unroll your loops. An alternative would be to go through each array twice; once with the needle, and again to find the number of positive or negative elements (using a 'needle' that matches the minimum signed integer).

  • (unsigned) count the elements < needle
  • (unsigned) count the elements >= 0x80000000
  • add the results
  • if needle < 0, subtract the array length from the result

There's probably a lot to be optimized to my code below. I'm quite rusty at this.

                          ; NOTE: no need to initialize eax here!
  lea rdi, [rdi + rsi*4]  ; pointer to end of array = base + length
  neg rsi                 ; we loop from -length to zero

  mov ebx, 80000000h      ; minimum signed integer (need this in the loop too)
  cmp edx, ebx            ; set carry if needle negative
  sbb eax, eax            ; -1 if needle negative, otherwise zero
  and eax, esi            ; -length if needle negative, otherwise zero

loop:
  cmp [rdi + 4 * rsi], edx
  adc rax, 0              ; +1 if element < needle
  cmp [rdi + 4 * rsi], ebx
  cmc
  adc rax, 0              ; +1 if element >= 0x80000000
  inc rsi
  jnz loop
Ruud Helderman
  • 9,064
  • 1
  • 19
  • 39
  • This requires branching to switch from the `arr[i] < 0` loop to the `arr[i] < needle` loop (e.g. for positive `needle`). If the array contents stay constant (and you're using on the same array repeatedly, not different arrays of the same size), then that's an option. But probably worse than the `xor eax,eax` / `cmp` / `setl al` to create an integer from a signed-compare result like a compiler will make. 2 extra instructions vs. the unsigned loop, or 1 if you hoist the xor-zero for the tmp out of the loop. If not for defeating loop-unrolling, could be ok. (I'm writing an answer with this.) – Peter Cordes Jan 21 '19 at 19:24
  • @PeterCordes I added an alternate approach without the 'switch', but yeah, I fear it still may not meet OP's performance requirements. – Ruud Helderman Jan 21 '19 at 20:52
  • That's creative, but a lot worse than the normal loop using `setcc` like compilers make. e.g. gcc makes a 6-uop loop for the obvious C implementation of `count += (arr[i] < key);`. Your loop is 8 uops on Intel Sandybridge-family, or 7 on AMD. (And 9 uops on Intel P6-family where `adc-0` isn't special-cased as 1 uop instead of 2). IDK if you could avoid the CMC by using `sbb` or a CMOV, but loading twice isn't great. And neither is having a 2-cycle loop-carried dependency through RAX. Unrolling can't hide that. – Peter Cordes Jan 21 '19 at 21:15
  • And BTW, your setup outside the loop would be slightly more efficient as `xor eax,eax` / `test edx,edx` / `cmovs eax, esi`. You could use `cmp` with an immediate `80000000h` inside the loop. (Or for an extra tmp reg, use ECX not EBX. EBX is call-preserved in most calling conventions.) – Peter Cordes Jan 21 '19 at 21:21
  • Oh yes, I think you could avoid the CMC by doing `cmp ebx, [rdi + 4 * rsi]`. [CMP r/m32](https://www.felixcloutier.com/x86/cmp) has 2 opcodes, letting you have the memory operand on the left or right. I think you'd need to adjust the `0x80000000` to `0x7fffffff` because you want carry for any element with its MSB set. Alternatively, you could `mov ecx, [rdi + 4 * rsi]` and `cmp ecx,edx / adc`, then `shr ecx,31` / `add eax,ecx` to extract the high bit with a shift instead of through CF. That's still 5 uops per element, worse than 3 or 4 in my answer, but it's as good as GCC's loop. – Peter Cordes Jan 22 '19 at 17:50
  • I might be missing something about the first solution (switching from one loop type to another), but if I were willing to accept a branch, why not just a `cmp; jge` with the `jge` terminating early wtih the correct result? I.e., the classic "brancy" linear search. That should be *one* macro-fused uop! The downside is the branch misprediction. I guess the interesting part of your approach is that the cutover point is not at the found element, but at the switch between negative and positive, and for some arrays that may be very predictable (e.g., those with no negative elements at all!). – BeeOnRope Jan 23 '19 at 02:36