Can you check a flag on a byte, AND retrieve the remaining 7-bit integer value, in one assembly operation?

Question

What is the most optimal way (fewest/fastest operations) to take an 8-bit, 16-bit, 32-bit or 64-bit number, extract the first bit off of it, check if that bit is true, and meanwhile store the resulting number after removing the leading bit? (In assembly).

integerInBitsWithLeadingFlag = 10001000
flag == 1
integer == 0001000 = 1000

In assembly, I know there are tricks here and there for dividing and keeping the remainder, storing essentially two variables in the result, or other such things. Maybe there is some way to do that in assembly.

The reason I'm asking is because I want to store large numbers in a sequence of 8-bit values, where the leading bit is the flag saying whether or not "more" values should be concatenated together, and the remaining 7 bits are used to calculate the final integer/bigint. If it's better to instead store the flag on the last/trailing bit, then that would be good to include :)

I am new to assembly so I'm not really sure how this could be done.

; assembly pseudocode
start:
  AND rax, 10000000 ; AND the value with a leading 1 (or something like this)
  CMP rax, 1 ; compare the leading value with 1 to see if it matches.
  JE matches
  JNE notmatches

matches:
  ; remove bit from rax to get integer value.

notmatches:
  ; same, remove bit from rax to get integer value.

Is there something in which I can do it like along these lines:

start:
  ANDFLAGWITHREMAINDER rax, 10000000
  ; and now, after 1 operation,
  ; you have the flag and the integer.

If not, what is the right way to do this?

If you want to do this on top of C/C++ check out [the Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) for many possible operations that you could use. — Nathan S., May 01 '20 at 01:54
*I want to store large numbers in a sequence of 8-bit values* - That's usually a mistake for BigInteger stuff in general: see codereview [BigInt class in C++](https://codereview.stackexchange.com/a/237764). 30 value bits in a 32-bit chunk can be good for portable stuff (like Python uses internally), or for SIMD to allow you to defer carry: [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) — Peter Cordes, May 01 '20 at 04:13
Or are you making a variable-length encoding that's compact for small values, hence the byte granularity? That sounds reasonable, especially for Intel CPUs with BMI2 where `pext` can do the packing - https://www.felixcloutier.com/x86/pext very efficiently with a mask that has those flag bits cleared. But unfortunately `pext` is slow on AMD (microcoded, not dedicated HW), and you might need a fallback for CPUs without BMI2? Anyway, I'm picturing length-finding with `nand` and lzcnt to find the first cleared more-bytes bit, then `pext` and `bzhi` in some order to pack and clear high garbage — Peter Cordes, May 01 '20 at 04:16
Or if you make the signal bit the low bit, `shr al, 1` shifts it into CF while bringing the other bits down. If you start from the end of a value, you can shift across registers with `shrd`, or across byte boundaries with partial registers. Hmm, there might be an idea here. Note that saving instructions isn't always the most important. Good ILP and short critical path latencies are important, too. — Peter Cordes, May 01 '20 at 04:22

Peter Cordes · Accepted Answer · 2020-05-04T06:25:31.770

x86 Bit Test and Reset btr eax, 7 does exactly what you asked for: clear bit 7 and set CF = the original value of that bit.

btr reg, imm or reg, reg is 1 uop on Intel, 2 uops on AMD. When followed by a jcc instruction, it can't macro-fuse into a single compare-and-branch uop the way test al, 1<<7 / jnz does, though. (https://agner.org/optimize/). Instruction count is not the only factor in performance. Good ILP and short critical path latencies, especially avoiding unnecessary loop-carried dependency chains, are important, too. But counting front-end uops for the fast path in your code is definitely something to consider.

x86 shifts (like most ISAs) put the last bit shifted out into the Carry Flag. So shr al, 1 sets CF = orig & 1 and updates AL = orig >> 1. Possibly there's a way to combine this with shifting bits across bytes to merge them like with shrd, or with partial-register tricks...

Since you're manipulating bytes, Why doesn't GCC use partial registers? is something you might want to understand if you're thinking about ways to combine multiple bitfields into one contiguous larger integer in a register.

I want to store large numbers in a sequence of 8-bit values

I hope you're not planning to compute directly with numbers in that format. That sounds reasonable as a compact variable-length serialization format / encoding that can be smaller than int for small values, but still hold up to uint64_t or even larger if necessary.

If overall speed is more important, work in chunks of at least 32-bit so you're getting many more result bits per CPU operation. Or so there are fewer unpacking steps to combine into a single contiguous binary integer. (e.g. AVX2 variable shift vpsrlvd to put the top of one dword contiguous with the bottom of the dword in the next higher element, then use qword variable shifts to do the same thing into the middle of a 128-bit lane. Then byte shift or shuffle.

(However, you can do that for 16-bit elements with a 16-bit pmullw multiply by a power of 2 (or 1) for that if you use 16-bit chunks. Or AVX512BW variable-count or merge-masked 16-bit shifts. Or 8-bit chunks with pmaddubsw to even combine into the bottom of SIMD elements with the right multipliers: 1 and 1<<7 for the low and high bytes of each pair, after masking away the signal bits.)

Bytes are usually a mistake for BigInteger stuff in general. See this codereview answer on BigInt class in C++ (which unwisely planned to store numbers a arrays of ASCII decimal digits). 30 value bits in a 32-bit chunk can be good for portable stuff (like Python uses internally). You can work in a base like 10^9 if you do a lot of converting to/from decimal strings, or 2^30 if not.

Not using all the bits per limb allows you to defer carry (not normalize) for SIMD: Can long integer routines benefit from SSE?. That could work with 1 spare bit for carry, and the top bit dedicated to your signalling strategy for implicit-length instead of storing a length. (Many x86 SIMD instructions like blendvps or movmskps make the top bit of a dword SIMD element special, or of each byte for integer SIMD like pmovmskb, pshufb, and pblendvb. So you can get a bitmask of the high bits that you can bit-scan with bsf or bsr to find the first set bit.)

Unpacking ideas:

If you choose the high bit of each byte = 0 within an integer, and 1 to signal the end, that avoids stray set bits that you have to clear.

As mentioned earlier, pmaddubsw is a good bet for SIMD to combine bytes into 16-bit words, with the right 1 and 1<<7 multipliers.

Then another step with pmaddwd can combine words to dwords with 1, 1<<14, then you're set up for AVX2 vpsrlv or just shift and blend. All of this then take log2(vector_length) number of steps instead of vector_length steps.

Without SIMD, you can use x += x as a left shift, but make it only shift some bits by doing x += x & mask. This works for scalar, or with paddb if you don't have SSSE3 pmaddubsw. (Or with AVX512BW byte-masked vpaddb for lower latency than pmadd.)

   x = ... (bits) 0abcdefg 0ABCDEFG
   +  (hex) x & 0x...00FF00FF
                  0abcdefg ABCDEFG0

That gives you contiguous 16-bit chunks holding 14 value bits. Each chunk separated by two 0 bits this time, though, so masked addition isn't the most efficient way to proceed. Probably from there, AND with 0xFFFF0000FFFF0000 and right shift that by 2, then mask the original the other way and blend with an OR.

Deserializing this format with BMI2 `pext` Parallel bit EXTract

pext is slow on AMD (microcoded, not dedicated HW, even on Zen2), and BMI2 isn't available everywhere. But on Intel CPUs it's 1 uop, 3 cycle latency, 1/clock throughput. https://uops.info/

Note that you could do this in C with intrinsics: Intel's online intrinsics guide / search. (non-SIMD scalar intrinsics portability across compilers can be spotty, but ones for new instructions like popcnt and BMI are generally fine.) Compilers may not be great at using btr to CSE x&(1<<7) and x &= ~(1<<7) into a single operation, but they should handle this code, even if you write stuff like (~mask) & x instead of intrinsics. Although probably compilers will do constant propagation and materialize the inverted constant for and instead of doing andn.

Given a pointer to an unknown-length number in this format, load up to 8 bytes and extract the up-to-56 value bits from them. (Assumes that a qword load is safe: might load garbage, but not cross into an unmapped page and fault: Is it safe to read past the end of a buffer within the same page on x86 and x64?)

; input pointer in RDI: a set bit 7 indicates end of number
; clobbers RCX, RDX
; result in RAX
;; great on Intel, probably can do better on AMD one byte at a time or with SIMD pmaddubsw 

    mov    rcx, [rdi]                     ; 8 bytes, including possible trailing garbage
    mov    rdx, 0x7F7F7F7F7F7F7F7F
    andn   rsi, rdx, rcx                  ; isolate high bits: (~rdx) & rcx

    blsmsk rax, rsi                       ; get mask up to lowest set bit: (rsi-1) XOR rsi = mask up to (and including) the first signal bit
    and    rax, rcx                       ; clear high garbage
       ; RAX = 0 above the number.  The end-of-number flag is still set but pext only grabs bits 6:0 from each byte.

    pext   rax, rax, rdx                  ; rax = 8x 7-bit fields packed down to low 56
   ; uint64_t result in RAX, from the first 1 to 8 bytes of [rdi],
   ; depending on where the first number-end flag was

If no byte had its high bit set, blsmsk with an all-zero input produces an all-ones output. So we extract all 8 bytes for the number-not-ended case as well as for the case where the top bit of the input is set.

andn and blsmsk are single-uop single cycle latency, but they are in the dependency chain leading to pext for there's no instruction-level parallelism within this one block for one iteration. It's pretty short so if we were doing another iteration of it on another 8 bytes of data, OoO exec could overlap nicely.

It would be cool if we could run pext in parallel with calculating a mask we could use on its output instead of input. But that 7:8 ratio is a problem. We could run pext twice in parallel (with a different mask) to line up the high bits of each byte with where they're needed for blsmsk. Or we could tzcnt to find the position of the lowest set bit, then somehow multiply by 7/8. The position is a multiple of 8 so we could tzcnt and do x - (x>>3) or something, then use that bit-index for BMI2 bzhi.

If you have a packed stream of numbers in this format, you'll want to find where the next one starts. From the rsi isolated end-flag pattern, you can rsi = tzcnt(rsi) then rsi >>= 3 to find the byte index of the first end-of-number bit.

You need to add 1 more than that to go past that. You could do lea rdi, [rdi + rsi + 1] but that has extra latency compared to inc rdi / add rdi, rsi because of the 3-component LEA (two + operations).

Or if you left-shifted the mask before tzcnt, you could do that directly, and as a bonus it would treat the no-terminator as 8 instead of 9.

    add   rsi, rsi               ; left-shift the end-of-number flags to the bottom of the next byte (or out)
    tzcnt rsi, rsi               ; rsi = bit-index of first bit of next number, 64 if RSI=0

 ; shrx  rcx, rcx, rsi           ; shift to the bottom and restart instead of reloading

    shr   esi, 3                 ; bit index -> byte index.  We know it's a small integer so 32-bit operand-size is fine and more saves code-size
    add   rdi, rsi               ; advance the pointer

This work can run in parallel with blsmsk and pext. Although if we're doing that tzcnt work anyway, perhaps we should use bzhi instead of blsmsk/and. That could be better for throughput but worse for latency: add -> tzcnt is 4 cycles of latency from RSI being ready before an input is ready for bzhi, and all of that is in the critical path leading to pext. vs. blsmsk/and being only 2 cycles.

Or if we wanted to loop until we find the end of a single number (more than 8 bytes), RSI still holds the isolated signal bits. That's why I did andn into RSI, instead of into RAX.

    ... continuing from above to keep going until the end of large number
    ... do something with the 56-bit RAX chunks, like overlapping stores into memory?
    ; rsi still holds the isolated signal bits
    add    rdi, 8
    test   rsi, rsi
    jnz   .loop                     ; }while(end of number not found)

Or blsmsk sets CF if its input was zero, so if we could structure our loop with blsmsk at the bottom we could use that as the loop branch. It also sets FLAGS, so perhaps with loop rotation and peeling first/later iterations

The BMI2 PEXT section is a few random ideas jumbled together, not one coherent fully optimized implementation. Adapt as needed depending on any guarantees you can make. e.g. an upper bound of 8 bytes per number would be helpful.

One major thing to keep an eye on is latency of the loop-carried dep chain involving a pointer increment. If it's high latency to find the start of the next number, out-of-order exec won't be able to interleave the work of many iterations.

Semi-related, another bit-packing / unpacking problem: Packing BCD to DPD: How to improve this amd64 assembly routine?

Can you check a flag on a byte, AND retrieve the remaining 7-bit integer value, in one assembly operation?

1 Answers1

Unpacking ideas:

Deserializing this format with BMI2 pext Parallel bit EXTract

Deserializing this format with BMI2 `pext` Parallel bit EXTract