1

I am trying to find the leading bit of a unsigned 64bit int. I am using BSR as my processor doesn't have the LZCNT instruction. My problem is that once the input is exactly 2^32, it returns 2^64 as the leading bit value, then cycles back through outputs up until 2^64.

This is my code:

unsigned long int LeadingBit(unsigned long int a) {
  if(a==0)
    return 0;
  unsigned long int nlb;
  asm (
       "BSR %1, %0   \n"
       : "=r" (nlb)
       : "mr" (a)
       : "cc"
       );
  return 1<<nlb;

}

The intention of this code is to be able to input a 64-bit integer, and have it return the value of the position of the leading 1. e.g: a = 65 (1000001) returns 1000000.

Michael Petch
  • 42,023
  • 8
  • 87
  • 158
  • 3
    By the way `__builtin_clzll`, despite having `clz` in the name, does not require LZCNT – harold Aug 23 '19 at 21:54
  • 2
    You haven't said what your problem is. Does it fail to compile? Produce the wrong answer for certain values? – David Wohlferd Aug 23 '19 at 22:06
  • 2
    I think what you have wrong is that `return 1< – Michael Petch Aug 23 '19 at 22:22
  • Thank you for the edit. That does help clarify what you are trying to accomplish. However it still doesn't tell me what your problem is. For instance, knowing that it only works for certain values would be consistent with Michael's suggestion. A compile error might suggest something else. FYI: It may be possible to avoid doing a separate check for a==0. BSR sets the ZF if the operand is zero, and inline asm has a way of [outputting flags](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#FlagOutputOperands). You'd still do the `if`, but against a flag output variable. Tiny bit better. – David Wohlferd Aug 23 '19 at 22:38
  • @DavidWohlferd I've implemented Michael's suggestion and it still doesn't resolve the issue. My issue is still the fact that while I'm inputting up to 64bit unsigned ints, it ignores anything past 32 bits. – user11969451 Aug 23 '19 at 22:42
  • Are you printing the result out properly? How are you getting the value? If you fixed the thing I mentioned (undefined behavior because of integer promotion and undefined behaviour with the shift) then there is a problem with how you are getting the input or how you are outputting the value. Can you show us a complete example that fails. – Michael Petch Aug 23 '19 at 22:48
  • 2
    @MichaelPetch Incredible! Thank you so much! I figured it out based on your update. – user11969451 Aug 23 '19 at 23:04
  • You don't need inline asm for this. OTOH, if you're using asm to work around compiler missed optimizations you should xor-zero another output and use `bts` inside the same inline asm statement instead of letting the compiler do a bad job compiling `1ULL << bitpos` (usually with `mov $1, %reg` ; `shl %cl, %reg` which costs 3 uops for the shift vs. 1 uop for `bts` on Intel CPUs.) – Peter Cordes Aug 23 '19 at 23:41
  • 2^64 is 0 in a 64 bit register, is that really what it returns? – Alexis Wilke Aug 24 '19 at 00:08
  • @user11969451 : What would also help is if you added to your other code you had that made the program fail. I ask this because apparently it wasn't the only problem you had based on your comments. – Michael Petch Aug 25 '19 at 11:48

3 Answers3

4

As Michael Petch pointed out, this line is the problem:

  return 1<<nlb;

The computation happens as an int, not an unsigned long. Replace it with:

  return 1UL<<nlb;
Florian Weimer
  • 27,114
  • 3
  • 27
  • 69
3

While Michael/Florian's fix is probably the easiest, perhaps it's not the best.

Your existing code (modified with 1UL) compiles down to this:

    xor     eax, eax
    test    rdi, rdi
    je      .L1
    mov     eax, 1
    BSR rdi, rcx   

    sal     rax, cl
.L1:
    ret

Not bad, but instead of testing for zero, and then calling BSR (which also checks for zero), how about:

unsigned long int LeadingBit(unsigned long int a) {

  unsigned long int nlb;
  bool z;
  asm (
       "BSR %[a], %[nlb]"
       : [nlb] "=r" (nlb), "=@ccz"(z)
       : [a] "mr" (a)
       );

    unsigned long int one;
    if (!z)
       one = 1;
    else
       one = 0;

    return one << nlb;
}

Since BSR sets the ZF to indicate zero, this code uses it to set one to either 0 or 1, depending. The asm for this is pretty clean (gcc 9.2 -O2):

BSR rdi, rcx
setne   al
movzx   eax, al
sal     rax, cl
ret

The "=@ccz" constraint is described in the docs under Flag Output Operands. Basically it's just saying "The value of this variable is taken from the Z (C)ondition (C)ode."


Edit: I wasn't surprised to see Peter offer observations, but I was surprised that he didn't post an alternative of his own.

Here's my attempt to translate his comments into code. This may be more complicated than OP would be interested in, but for the sake of completeness:

unsigned long int LeadingBit(unsigned long int a) {

  unsigned long int bit;
  unsigned long int res;
  asm (
       "BSR %[a], %[bit]  \n\t"  // sets both (bit) and ZF
       "BTS %[bit], %[res] \n\t" // sets the corresponding bit in (res) and
                                 // carry flag, but *doesn't* change ZF
       "CMOVZ %[zero], %[res]"   // reset res to 0 if BSR detected a==0
       : [bit] "=&r" (bit), [res] "=&r"(res)
       : [a] "mr" (a), [zero] "r"(0UL), "[res]"(0UL)
       : "cc"
       );

    return res;
}

While the BSR/BTS/CMOVZ is pretty straight-forward, that crap with the constraints might be something that maintainers of your code would struggle with.

So, to explain what's happening.

  • "=&r"(res) is going to hold the return value. The & is used to indicate that it can't share a register with any other parameter. To be clear, just because you declare a constraint as "=r", that doesn't mean you will get a unique register, unless you use &. This can be a good thing by reducing the number of registers used, but not so much in this case. If the compiler decided to use the same register for %[zero] and %[res], that would cause the CMOVZ to fail.
  • The "[res]"(0UL) says that on entry to the asm, whatever register is used for %[res] should be initialized to zero. Yes, we can do this in the asm, but by putting it in the C code, it allows the compiler to do this in whatever manner might be most efficient with respect to the surrounding code. Heck, it might already have a zeroed register laying around that it just used for something else.
  • BTS lets you directly set a bit number in a register. It's typically faster than using SAL (that's what Peter was talking about with "3 uops" for SAL vs 1 for BTS), but like BSR, gcc doesn't have an intrinsic for it. And while it does return the existing value of the specified bit in the carry flag (that's what the T part of BTS means), it doesn't modify the Zero flag. This allows us to still do the CMOVZ after the BTS. Updating "part" of the flags register this way can be inefficient on some processors, but not so much with the newer ones.

Here's the output:

    xorl    %edx, %edx
    movq    %rdx, %rax
    BSR %rdi, %rcx  
    BTS %rcx, %rax 
    CMOVZ %rdx, %rax
    ret

I'm a little dubious about that movq (why not xor or movl?), but I'm sure there's a good reason. Something to do with 'aliasing' I assume.

If perf was of sufficiently high priority (although OP never said it was), there's one more thing I can think to do. If LeadBit might ever be called with a constant, the compiler can typically pre-calculate a lot of the math around it at compile time instead of when you run the program.

However, gcc can't precalculate thru inline asm. If calling LeadBit with a constant is a possibility, you could wrap the code I show here with if (__builtin_constant_p(a)) to see if a is a constant, and use __builtin_clzll et al to calculate it the "slow" way. I say "slow," but calculating value (even the slow way) at compile time is going to result in a faster executable. And since (by definition) the value of __builtin_constant_p is known at compile time, only one branch of the if will be generated for any invocation of LeadingBit.

David Wohlferd
  • 6,158
  • 2
  • 24
  • 45
  • Using `sal` at all is pretty bad. 3 uops on Intel SnB-family CPUs. Better would be `bsr` ; `bts` into a zeroed register. And `cmovz dst, orig_input` on conditionally zero (ZF still set from BSR because BTS only modifies CF). Partial flag stall on P6 family but cheap elsewhere, although `cmov` is 2 uops before Broadwell. – Peter Cordes Aug 24 '19 at 10:26
  • Your version would be ok if compiled with BMI2 for `shlx` (1 uop), though. Or on AMD like this, where `shr reg,cl` is 1 uop but `bts reg,reg` is 2. Except this ends up with a `movzx` unnecessarily in the critical path (instead of xor-zeroing a destination for `setne` in parallel with BSR). Even worse, it defeats mov-elimination by zero-extending from AL into EAX instead of picking a separate source like DL. (Intel can eliminate movzx r32, r8 as well as mov, unlike AMD) – Peter Cordes Aug 24 '19 at 10:29
  • So if you're using inline asm at all to take advantage of the BSR flag results, instead of using `31 - __builtin_clzll()` like a normal person, you might want to fix the compiler's missed-optimization bugs for current microarchitectures by doing more inside the asm. Unless the compiler can take advantage of knowing that your mask has a single bit set later, or if you branch on it being non-zero... Although you can probably still produce that as a flag result in ZF. – Peter Cordes Aug 24 '19 at 10:30
  • I posted my own version of this with an optimization you missed from my comments. If `a==0`, you can use *that* as your zero for cmov. Also a version that avoids cmov by using `btc` to flip a 1 back to a 0 in the `a==0` case, using AMD's documented behaviour of leaving the BSR destination unmodified. – Peter Cordes Aug 25 '19 at 03:47
  • Note that the Intel documentation says that `DEST` is _undefined_ when `SRC` is zero. It looks like it's more like `unmodified` which is an interpretation of _undefined_ (and AMD documents it that way). It looks like gcc/g++ also understand it as documented by AMD. – Alexis Wilke Nov 20 '19 at 00:23
1

TL:DR: you should only use inline asm if you're going to optimize the whole thing and beat the compiler. Otherwise use intrinsics like __builtin_clzll or GCC-only __builtin_ia32_bsrdi


Your actual problem was unrelated to inline asm.

1 has type int, and unlike most operators, << doesn't promote the left side to match the right side. So you're shifting an int by more than 31 bits, which is undefined behaviour. In practice you get a 32-bit operand-size shift, which you could have noticed by looking at the compiler's asm output. (Generally a good idea whenever you're using inline asm.)


You don't need inline asm for this; it's easy to express using GNU C builtins.

But if performance is critical, you might want to use inline asm to work around compiler missed-optimizations for current microarchitectures in the shift part, especially if you're compiling without BMI2 for efficient variable-count shifts on Intel CPUs.

Note that bsr is quite slow on AMD CPUs while lzcnt is fast on all CPUs that support it. On Intel, both are fast. But unlike for bsf/tzcnt, the instructions produce different results even for non-zero inputs so using rep bsr to get a fast lzcnt on newer CPUs wouldn't be useful the way it sometimes is for tzcnt. (tzcnt runs as bsf on older CPUs, lzcnt runs as bsr on older CPUs. That's because the encoding for lzcnt is a REP prefix in front of bsr)

Unfortunately bsr is quite slow on Atom/Silvermont, too. Like on Goldmont Plus: 11 uops, 9 cycle latency, and 8 cycle throughput.


See my answer on Find most significant bit (left-most) that is set in a bit array for a rundown of how various compilers are dumb with 63-__builtin_clzll(x); which could optimize back to just a bsr, but doesn't.

GCC specifically (not clang) has an intrinsic __builtin_ia32_bsrdi for 64-bit bsr, and both support _bit_scan_reverse for 32-bit.


The obvious way to write this is this portable code that works on any target supported by GCC:

uint64_t LeadingBitIsolate_gnuc(uint64_t a)
{
  static_assert( CHAR_BIT * sizeof(unsigned long) == 64, "__builtin_clzll isn't 64-bit operand size");

  uint64_t bit = 63-__builtin_clzll(a);      // BSR
  return a ? (1ULL << bit) : 0;              // ULL is guaranteed to be at least a 64-bit type
}

If a is ever a compile-time constant, constant-propagation through this function works. Using inline asm always defeats that unless you use something like if(__builtin_constant_p(a)) { non-asm version } else { asm version}. (https://gcc.gnu.org/wiki/DontUseInlineAsm)

I used uint64_t for portability to x86-64 targets where unsigned long is a 32-bit type. (Linux x32 (an ILP32 ABI), and MS Windows x64). Also to non-x86 targets since this doesn't use any inline asm.

Unfortunately this compiles to pretty bad asm with gcc and clang (Godbolt)

# gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_gnuc(unsigned long):
        movq    %rdi, %rax
        testq   %rdi, %rdi
        je      .L4

        bsrq    %rdi, %rax
        movl    $63, %ecx
        xorq    $63, %rax      # bit = 63 - bit   in the low 6 bits, 2's complement bithack
        subl    %eax, %ecx     # bit = 63 - bit

        movl    $1, %eax       # 1 uop
        salq    %cl, %rax      # 3 uops
.L4:
        ret

Using BSR64 = __builtin_ia32_bsrdi from my answer on Find most significant bit (left-most) that is set in a bit array we get this from GCC (and similar using cmov from clang and ICC). ICC/MSVC provide intrinsics that return a bool for the ZF output from BSR. test/je has the same cost as je on modern x86, but saving the test instruction in front of cmov does have value.

// See the linked answer for the ICC/MSVC part of this
#ifdef __GNUC__
  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
      // update: clang8.0 regressed here but still doesn't do __builtin_ia32_bsrdi
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)
#endif

uint64_t LeadingBitIsolate_gnuc_v2(uint64_t a)
{
  uint64_t bit = BSR64(a);
  return a ? (1ULL << bit) : 0;
}

This compiles better with GCC, at least for the BSR part. But still not for the shift part.

# gcc -O3 -mtune=skylake
LeadingBitIsolate_gnuc_v2(unsigned long):
        movq    %rdi, %rax
        testq   %rdi, %rdi
        je      .L9
        bsrq    %rdi, %rcx
        movl    $1, %eax
        salq    %cl, %rax       # missed optimization: xor-zero + bts would be much better especially with tune=skylake
.L9:
        ret

On the plus side here, we've optimized out most of the BSR overhead by using a BSR intrinsic. And it can still compile to LZCNT on CPUs where that's a better choice, e.g. with clang -march=znver1 (Ryzen). GCC still uses BSR but clang uses 63-lzcnt which will run faster on AMD.

63-__builtin_clzll is portable to non-x86 but __builtin_ia32_bsrdi isn't. (DI = double-word integer = 64-bit).

But the compiler does still know what's happening, and can optimize this into surrounding code, and for compile-time-constant inputs.

E.g. it can branch over other things, too, if the input=0, only testing that once. And it knows that the input only has a single bit set, so in theory if you're using it as a divisor a compiler could be smart enough to AND with res-1 instead of using a slow div instruction. (Division by zero is UB so the compiler can assume that didn't happen, maybe even optimizing away the a==0 ? part of this after inlining.)

Compiling this with BMI2 available for SHLX would make it efficient on modern Intel.


Using inline asm to work around missed optimizations for Sandybridge-family

SnB-family has slower-than-usual variable-count shifts: instead of potential stalls if the flag-result is read, it includes 2 extra uops to check if the count is zero and conditionally merge the shift's flag-result into FLAGS. (shl %cl, %reg has to leave FLAGS unmodified if cl=0. This is the kind of legacy baggage that's part of the "x86 tax" that high performance x86 CPUs have to pay to do superscalar out-of-order execution.)

AMD apparently manages shifts as a single uop without limitations / penalties.

Thus bts into a zeroed register is a better choice for implementing 1ULL << count, especially on Intel CPUs where xor-zeroing is extra cheap, and bts reg,reg is a single uop (instead of 2 on Ryzen).

Compared to the version in @DavidWohlferd's answer (based on my comments), this saves instructions by using a as the zero in the case where a==0, instead of needing an extra zeroed register. And the comments talk about performance implications.

#include <stdint.h>
#include <limits.h>

uint64_t LeadingBitIsolate_asm(uint64_t a)
{
  uint64_t bit;
  uint64_t bts_target = 0;
  asm (
       "BSR %[a], %[bit]  \n\t"    // ZF = (a==0)
       "BTS %[bit], %[res] \n\t"   // sets CF = old bit = 0 but not ZF
         // possible partial-flag stall on some CPUs for reading ZF after writing CF
         // but not SnB-family
       "CMOVZ %[a], %[res]"        // res = a (= 0) if BSR detected a==0
       : [bit] "=r" (bit), [res] "+r"(bts_target)  // no early clobber: input read before either output written
       : [a] "r" (a)
       : "cc"              // redundant but not harmful on x86: flags clobber is always implied
       );

    return bts_target;
}

Reading ZF after BTS writes CF is dangerous for older Intel CPUs, where it will cause a partial-flag stall. See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

GCC 9.2 gives us a sequence that's only 4 uops on Skylake. (https://agner.org/optimize/). The xor-zeroing needs a front-end uops, but on Sandybridge-family doesn't need a back-end execution unit (0 unfused domain uops).

It's 5 uops on Haswell and earlier where cmov is 2 uops / 2c latency. Since you say your CPU doesn't support lzcnt, you might have an IvyBridge or earlier (2 uop cmov), or an old AMD.

Branching might be worth it if input=0 is expected to be rare or never happen, especially if this is part of a critical path for latency. Especially on older Intel where cmov is 2 uops.

gcc9.2 -O3 -mtune=skylake
LeadingBitIsolate_asm(unsigned long):
        xorl    %eax, %eax
        BSR %rdi, %rdi  
        BTS %rdi, %rax 
        CMOVZ %rdi, %rax
        ret

A "rm"(a) input constraint would be possible, but

  1. we read it twice so it's better to have the compiler load into a register once
  2. clang is dumb and always picks m when it's an option, even if that means spilling a value that was already in a register. clang (LLVM) inline assembly - multiple constraints with useless spills / reloads
  3. bsr has a "false" dependency on existing CPUs, so it's better if the compiler happens to pick the same register for res and a instead of writing a new register. This is what usually happens if a isn't needed later.

BSR isn't really a false dependency: if the input is zero, the destination is left unmodified. AMD even documents this, but Intel implements it in their hardware while leaving their documentation saying "undefined" contents.

We can't take advantage of this anyway, we have 65 possible outputs and BTS into a zeroed reg can only produce 64 different outputs.

You might be tempted to use rcl (rotate-through-carry) on a register holding 1 but first of all rcl %cl, %reg is quite slow, and 2nd it still masks the shift with & 63 so it can't shift a 1 all the way out into CF.

No partial-flag stalls, and shorter critical path on older Intel

Using BTC and taking advantage of BSR zero-input behaviour we can make a version that's possibly better on older Intel P6-family, like Nehalem and Core 2 (where cmov is 2 uops and partial-flag stalls are a problem).

It doesn't save uops though because it requires an extra xor-zeroed register. It does shorten the critical path latency, and could shorten it even more if we used test/setz in parallel with 3-cycle-latency bsr, instead of using the ZF output from BSR. Or if the compiler already did math on the input, it might already have ZF set appropriately. Unfortunately there's no way to query that.

btc = complement = flip a bit, instead of unconditionally setting. This can create a zero result if the destination register is 1 instead of 0, if we know the bit-index for that case.

xor-zero / set flags / setcc is a standard idiom, generally at least as efficient as set flags / setcc / movzx because xor-zeroing can be even cheaper than movzx, and it's off the critical path for latency. setcc is a 1-uop instruction on all x86-64 CPUs.

// depends on BSR behaviour that only AMD documents, but all real CPUs implement (for now?)
// it seems unlikely that this will change, though.
// The main advantage of this version is on P6-family and early Sandybridge
// where it definitely does work safely.
uint64_t LeadingBitIsolate_asm_p6(uint64_t a)
{
  uint64_t btc_target;
  uint64_t bit = 0;
  //bool input_was_zero;
  asm (
       "xor   %k[res], %k[res]\n\t"  // make sure we avoid P6-family partial-reg stalls with setz + reading full reg by forcing xor-zeroing, not MOV
       "bsr   %[a], %[bit]  \n\t"    // bit=count or unmodified (if a==0)
       "setz  %b[res]\n\t"           // res = (a==0)

       "btc   %[bit], %[res] \n\t"   // flip a bit.  For a==0, flips bit 0 which SETZ set to 1
       : [bit] "+r" (bit), [res] "=&q"(btc_target) // q = reg accessible as low-8, like RAX has AL.  Any x86-64 reg
          // early-clobber: [res] is written before a, but [bit] isn't.
          // ,"=@ccz" (input_was_zero)  // optional GCC6 flag output.  Or "=@ccc" to read from CF (btc result) and avoid partial-flag stalls
       : [a] "r" (a)
       : "cc"
       );

    return btc_target;
    //return input_was_zero;
}

GCC9 and trunk has a weird register-allocation missed optimization where it produces res in r8 and has to mov it back to RAX.

gcc8.3 -O3 -mtune=nehalem
LeadingBitIsolate_asm_p6(unsigned long):
        xorl    %edx, %edx        # compiler-generated
        xor   %eax, %eax          # inline asm
        bsr   %rdi, %rdx  
        setz  %al
        btc   %rdx, %rax 

        ret

On Intel CPUs like Core2, Nehalem, and Sandybridge-family, this is 5 uops with no stalls. BSR has 3 cycle latency, the rest have 1 cycle latency. From RDI input to RAX output, the latency is 3 + 1 + 1 = 5 cycles. (setz has to wait for BSR's output. As I mentioned above, test/setz on a before the bsr would allow instruction-level parallelism, spending an extra uop on test to shorten critical-path latency by another cycle.)

On AMD Bulldozer-family / Ryzen, it's a lot more expensive just because of bsr. The setz is still 1 uop, and the btc is 2 uops.

On Atom/Silvermont, it's also expensive because of BSR.

There's no workaround for slow BSR unless you make a version using lzcnt and do runtime dispatching or something. (Probably for a loop that calls this function; call overhead might still be worse than using bsr on a CPU with 8 to 10 uop bsr. Especially Ryzen with its high uop throughput.)


This works fairly obviously for the a!=0 case, doing BTC into a zeroed register to flip the bit at the bit-index found by BSR.

For the a==0 case:

       # starting: rdx=rax=0
        bsr   %rdi, %rdx       # leaves RDX=0 (unmodified) and sets ZF
        setz  %al              # sets RAX=1
        btc   %rdx, %rax       # flips bit 0 of RAX, changing it back to 0
       # rax = 0 (result), rdx = 0 (bitpos)

When I used two "+r" inputs both set to zero in C, GCC decided to xor-zero EAX/RAX, but used a mov %rax, %rdx to copy that to RDX! (Wasting a REX prefix). That's clearly worse on Sandybridge which has elimination for xor-zeroing but not mov-elimination. It's better on AMD Ryzen, though, which has mov-elimination but still needs a back-end uop for xor-zeroing.

Copying a 0 with mov vs. xor-zeroing is basically neutral on P6-family (Core2 / Nehalem), except that only xor-zeroing avoids partial-register stalls for writing AL with setcc and reading RAX with btc. So I put the xor-zeroing inside inline asm to make sure it's actually xor-zeroing, not a mov, that gcc picks for res. (Why doesn't GCC use partial registers?).

Note that the BSR behaviour of leaving the destination unmodified for the input=0 case is only documented by AMD, not officially documented by Intel. But it is implemented by all Intel's hardware, at least any that supports 64-bit mode. IDK about Via Nano, or ancient Intel.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Hmm. You never do anything by halves, do you? And this is just 3 instructions. Is it wise to recommend a solution based on undocumented behavior in BSR? The early clobber would avoid this. "clang is dumb and always picks m": I'm not seeing this in godbolt. "mr" picks a register. "dangerous": but only from a perf pov. It's not like it gives the wrong answer or damages the cpu. Is putting `xor %k[res]` inside the asm really better than doing it in C? The compiler might already have a reg known to be 0. "missed optimization" one of my iterations here actually generated `mov edx,0`. Who does that? – David Wohlferd Aug 25 '19 at 08:08
  • @DavidWohlferd: yes, xor-zeroing inside the asm is necessary to make sure you avoid partial-register stalls on P6-family, if GCC xor-zeroes one register and uses a `mov` to copy that to another register. (Which it in fact did with an earlier version of this.) That explanation is buried in there somewhere. :P Only the xor-zeroing idiom establishes the AL = RAX flag that allows us to write AL then read RAX without a partial-register stall (about 3 cycles to insert a merging uop on Core2). – Peter Cordes Aug 25 '19 at 08:16
  • Re: early clobber vs. BSR AMD-only documentation: the `btc` version that depends on BSR leaving the dest unmodified still works if you declare early clobbers. Correctness doesn't depend on anything that GCC has a choice to break. And the output dependency can't be a problem because we xor-zero it. – Peter Cordes Aug 25 '19 at 08:20
  • @DavidWohlferd: The `cmov` version without an early clobber allows GCC to sometimes choose to overwrite the input register (which it already has a dep on) instead of writing a new register. Nothing guarantees that the false dep won't be a problem there (e.g. if `a` is needed later), but the cmov version doesn't depend on it for correctness. It's merely a performance risk that it might create a loop-carried dep chain or otherwise hurt ILP. – Peter Cordes Aug 25 '19 at 08:22
  • @DavidWohlferd: and yeah this answer got really complex because there are so many interlocking missed optimizations here. Lack of intrinsics for BSR / missed opts with `63-__builtin_clzll()`, and compilers don't know how to use `bts` / `btc` properly as a peephole optimization. – Peter Cordes Aug 25 '19 at 08:23
  • @DavidWohlferd: re: clang. Perhaps you were looking at ICC, not clang; I had 3 "windows" open in my godbolt link. I simplified it to just clang open, and removed the non-asm functions. https://godbolt.org/z/tE4HQv Both spill when given `"rm"` or `"mr"`. This is a known clang bug that's probably not going to be fixed any time soon, and a showstopper for using `"rm"` in portable performant code that wraps only a few insns. [clang (LLVM) inline assembly - multiple constraints with useless spills / reloads](//stackoverflow.com/q/16850309) – Peter Cordes Aug 25 '19 at 08:28
  • My compiler is gcc, but I get `error: use of unknown builtin '__builtin_ia32_bsrdi`? – theonlygusti Mar 28 '21 at 13:32
  • @theonlygusti: Are you trying to use it in a 32-bit build? It's only available when `bsr` with 64-bit operand-size is available. So with `gcc -m32`, you can only use `__builtin_ia32_bsrsi` (single-`int`, not double-`int`). https://godbolt.org/z/ezvs7ExWx – Peter Cordes Mar 28 '21 at 13:48