Get the first bit of the EAX register in x86 assembly language

Question

In x86 assembly language, is it possible to obtain the first bit of a register? I want to obtain the first bit of the eax register and move it into ebx, but I'm not sure how to do this yet.

.stack 2048

.data

ExitProcess proto, exitcode:dword 

.code
start:
mov eax, 3;
;now I want to move the first bit of eax into ebx. How can I obtain the first bit from eax?
invoke  ExitProcess, 0
end start

I know that it's possible to obtain the second half of `eax` using the `ax` register, but I'm not sure how I'd obtain the first bit from a register. — Anderson Green, Mar 06 '13 at 03:33
To get a better answer, it helps if you spell out what modifications you can tolerate in eax and ebx. For example, mov %eax,%ebx certainly moves the first bit into ebx, along with all the other bits. There is no single instruction that moves just 1 bit between registers. — srking, Mar 06 '13 at 05:00
@srking I want to move the first bit from `eax` to `ebx`, and set all of the other bits in `ebx` to 0. `eax` should remain unchanged. — Anderson Green, Mar 06 '13 at 15:52

Ira Baxter · Accepted Answer · 2013-03-06T05:50:09.667

If by "first bit" you mean the least significant bit, then try:

 ...
 mov   ebx, eax
 and   ebx, 01

You apparently don't understand that instructions operate on all the bits in a named register at once, and the "and" instructions combine their operands bit-by-bit.

The following works, too, and is arguably a more direct interpretation of your request ("get the first bit of eax, and then put in EBX") but it destroys the contents of EAX:

 ...
 and   eax, 1
 mov   ebx, eax

In assembly code, because you have few registers, their contents tend to be precious, so destroying one register's content in computing a new result is generally avoided. (When you can't, you can't, but this case it is easy to avoid).

Finally, you could write:

 ...
 mov   ebx, 1
 and   ebx, eax

This works fine, and is just as fast as the other two. I prefer the first because it emphasizes IMHO the value I care about (content of EAX) by virtue of mentioning it first, over the "1", which is just an incidental constant. This kind of style may not seem like it matters much, but if you write a lot of code, especially arcane stuff such as assembler, doing it to maximize later readability is worth a lot.

It is worth your trouble to find the Intel reference manuals, and read them carefully to understand what each machine instruction does. That seems like a daunting task because its a big book; just focus on the instructions you initially seem to need.

On CPUs where `mov` doesn't have zero latency (Intel before IvB, AMD before Ryzen), `mov ebx,1` / `and ebx,eax` has lower latency because the `mov` isn't on the critical path. This may not matter, and not be worth the extra code-size. (2 extra bytes net increase). — Peter Cordes, Apr 28 '18 at 21:23

score 5 · Answer 2 · answered Mar 06 '13 at 03:44

5

Obtaining a bit from a register involves an and operation with a mask that has a 1 in the bit position of interest, and 0 in all other bits. Then optionally, a rotate right or a rotate left to move the bit into the desired position in the result.

answered Mar 06 '13 at 03:44

Mark Taylor

1,793
13
17

I'm still a bit confused, since I'm not sure how the `and` operation should be used in this case. Do you have any concrete examples of this (i. e., code samples)? – Anderson Green Mar 06 '13 at 03:52
Search for 'x86 bitwise operations' to find many, many on-line resources. It a fundamental part of ASM. And also much too complicated for a single [SO] answer. – Mark Taylor Mar 06 '13 at 03:58
1

Remember that you can always RCR the lowest bit into CF, MOV the bit in CF to a register/memloc (at worst, via PUSHF), then RCL the bit in CF back to EAX. If you want the *highest* bit, RCL into CF and RCR from CF. – mkfs Mar 06 '13 at 08:40
@mkfs: `rcr` with count other than 1 is slow, like 8 uops / 6 cycle latency on Skylake (http://agner.org/optimize). The good ways to extract a single bit are `mov`/`shr`/`and`, or `xor ebx,ebx` / `test al, 1<<6` / `setz bl`. – Peter Cordes Apr 28 '18 at 21:41

score 0 · Answer 3 · edited Apr 28 '18 at 21:30

0

You can also use the rotation properties of assembly, which is to use the bit instruction to copy the specified bit into the carry flag, and we can use the rotation-carry flag instructions.

I think this is very efficient in terms of understanding what is going on when it comes to where and how the blasted bit is being used and placed.

edited Apr 28 '18 at 21:30

Patrick

1,615
6
18
27

answered Apr 28 '18 at 19:15

zbethel

1
1

You'd have to zero EBX first if you wanted to isolate the low bit of EAX in another register. e.g. `xor ebx,ebx` / `bt eax,0` / `rcl ebx,1`. That's obviously worse than `mov`/`and` for the special case where the bit you want is at the bottom of the register. For the general case, you *could* do that, but `rcl`-by-1 is a 3-uop instruction on Intel CPUs (http://agner.org/optimize/). The "obvious" way to extract a bitfield is `mov ebx,eax` / `shr ebx,6` / `and eax,1`, which is a total of 3 uops, and is what a compiler would normally use for C bitfields or if you write `(x >> 6) & 1`. – Peter Cordes Apr 28 '18 at 21:44
I'm still a novice, and honestly, I can't wait to get my hands on these optimization methods!! HOWEVER -- that being said -- I'm still a novice at assembly -- can you explain to me how rcl is a 3-uop versus shl , which apparently is 1-uop? I don't understand how they can be different, seeing how they technically accomplish the same goal? I know that rcr, or rcl, populates and distributes carry flag into register, and that shr shifts right and populates a zero into the shifted bit. Is it having to do with populating the carry flag? – zbethel Apr 29 '18 at 17:49
`rcl` has to read EFLAGS but `shl` only writes flags. `rcl` also has to modify some flags and leave others unmodified (and the set of flags it writes doesn't line up with how Intel CPUs break EFLAGS up into separate groups that are renamed separately to avoid partial-flag problems for most instructions.) See [What is a Partial Flag Stall?](//stackoverflow.com/q/49867597), and Agner Fog's microarch guide if you want full details. Also note that `shl reg, cl` is 3 uops on Intel CPUs, because if `cl` is zero it has to leave the flags unmodified (so it has an input dependency on EFLAGS, too.) – Peter Cordes Apr 30 '18 at 07:13
The other reason why `rcl` is slow is that it's rarely used, so the CPU doesn't have dedicated 65 / 33 / 17 / 9 bit rotate hardware, only 64 / 32 / 16 / 8. That's why `rcl` with count greater than 1 is so slow. CPUs can fairly efficiently shift in a bit from CF and use the usual HW to shift a bit *out* into CF, and set other flags. `adc eax,eax` does that more efficiently, with the same CF and `eax` results as `rcl eax,1`, but different other flags. It's 2 uops on Haswell and earlier (because it's 3 inputs: 2 registers + flags), but `adc reg,reg` is 1 uop on Broadwell and later. – Peter Cordes Apr 30 '18 at 09:37
Understanding *why* different instructions might decode to more or less uops can be pretty deep voodoo, so you should just check Agner Fog's instruction tables and/or profile your code with performance counters for `uops_issued.thread`, or use IACA ([What is IACA and how do I use it?](//stackoverflow.com/q/26021337)) to count uops for you for different microarchitectures if you only care about Intel, not AMD. – Peter Cordes Apr 30 '18 at 09:40

christopher westburry · Answer 4 · 2018-09-05T11:25:26.530

0

As stated above, using mov and and are the quickest solutions here, but the fun part about programming is that one can find many solutions to a single problem, so as a less efficient alternative, you could use one of the following codes.

So you can also solve it like this:

xor ebx, ebx ; sets ebx = 0
TEST eax, 1 ; is the lowest bit set?
JZ skip_add ; if low bit is zero, skip next instruction

; if lowest bit was set / AND op resulted not zero
add ebx, 1 ; ebx += 1, or use `INC ebx`

skip_add:
; process the result

Alternatively, you can also use:

xor ebx, ebx ; ebx = 0
shr eax, 1 ; shift right: lowest bit in carry flag now
adc ebx, 0 ; ebx += 0 + carry bit
shl eax, 1 ; get back original value of eax, `shl` and `or` with ebx
or eax, ebx ; or use `push eax` and `pop eax` instead

Another alternative (similar to the other answers, but more costy):

push eax
and eax, 1
xchg ebx, eax ; swap contents, could also use `mov` here
pop eax

Note that both solutions do not change the value in eax, so you can still use the value in eax freely. Also note that the commented values are for eax having the value of 3 as mov eax, 3 was used in the question.

If you already know, that ebx is zero, you can skip the xor lines and if changing eax doesn't matter, you can just delete the shl operation as well. So the actual operation is done by about two instructions, as you can see. About the µops, see the comment of Peter, though.

edited Sep 05 '18 at 11:25

answered Sep 04 '18 at 11:32

christopher westburry

449
1
5
13

1

`shl` shifts in a zero, and the copy of the bit in EFLAGS was destroyed by `adc` so you couldn't `adc eax,eax` or `rcl`. Perhaps you want `xor ebx,ebx` / `test al,1` / `setnz bl`. And no, none of these are more efficient than `mov ebx, eax` / `and ebx,1`. That's 1 cycle latency (or 2 on older CPUs without mov-elimination), and only 2 uops, while your branchless version is 5 uops (with push/pop), or 3 for my xor-zero/test/setcc. (https://agner.org/optimize/). Your branching version breaks the data dependency, but at the cost of branch mispredicts being likely if the bit is unpredictable. – Peter Cordes Sep 04 '18 at 19:10
true one. I should use `or al, bl` afterwards or `push eax` and `pop eax`to really get the value of eax back. Also I think I get it now, my misconception was thinking of `mov` as a slow instruction, which probably is only true if reading from memory. It's still amazing on how many solutions there are for a simple problem, though. But one question: If we could change eax, what's the difference in performace for using `xchg` rather than `mov`: `and eax, 1 / xchg ebx, eax`? – christopher westburry Sep 05 '18 at 11:04
1

You'd want `or eax, ebx` to avoid [partial-register slowdowns](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers) on Sandbybridge and earlier. `xchg` is terrible. [Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?](https://stackoverflow.com/q/45766444), and it destroys `eax`, unlike mov+and. `mov r32,r32` is extremely well optimized on IvyBridge and later, and on Ryzen. (It still costs a uop, but no latency: [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342)). – Peter Cordes Sep 05 '18 at 15:53
The interesting options is `mov ebx,1` / `and ebx,eax`, which takes the `mov` off the critical path even on older CPUs, but costs extra code-size (5 byte `mov r32,imm32`) – Peter Cordes Sep 05 '18 at 16:36

score -2 · Answer 5 · answered Mar 28 '13 at 07:10

-2

...with the first bit:

test eax, 1
setz ebx

Dirk

answered Mar 28 '13 at 07:10

Dirk Wolfgang Glomp

515
3
5

setz only works on 8-bit registers, so you'd need `xor ebx,ebx` / `test al,1` / `setz bl`. This is more uops, and can't benefit from mov-elimination on IvB+ / Ryzen+. (I optimized your `test` to the `al,imm8` form rather than the `eax,imm32` form. Unlike most instructions, [`test` doesn't have a `test r/m32, imm8` form](http://felixcloutier.com/x86/TEST.html), so `test eax,1` would use the 5-byte `imm32` form. Or for a register other than eax, the 6-byte `test r/m32, imm32` form.) – Peter Cordes Apr 28 '18 at 21:28
This idea is not bad if the bit you want isn't already the low bit, though. It's probably better than `mov`/`shr`/`and`. (But for repeated use on Intel CPUs with BMI2, `pext` is a win: 1 uop to copy-and-extract a bitfield into another register, given the right mask in a 3rd register. But it's slow on Ryzen. – Peter Cordes Apr 28 '18 at 21:38

Get the first bit of the EAX register in x86 assembly language

5 Answers5

Linked