Are these the smallest possible x86 macros for these stack operations?

Question

I'm making a stack based language as a fun personal project. So, I have some signed/unsigned 32-bit values on the stack and my goal is to write some assembly macros that operate on this stack. Ideally these will be small since they'll be used a lot. Since I'm new to x86 assembly I was wondering if you guys had any tips or improvements you could think of. I'd greatly appreciate your time, thanks!

Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so don't worry about that!

<SIGNED/UNSIGNED-COMPARISONS>
pop eax
cmp dword [esp], eax
setl byte [esp]        ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
and dword [esp], 1

<NEGATE-STACK>
neg dword [esp]

<NOT-STACK>
not dword [esp]

<DEREF-STACK>
mov eax, dword [esp]      ;eax = address
mov eax, dword [eax]      ;eax = *address 
mov dword [esp], eax      ;stack = eax

<SHIFT>
pop ecx
SHL dword [esp], cl  ;SAR, SHR

<ADD-STACK>
pop eax
add dword [esp], eax    ;sub, and, or, xor

<AND-STACK-LOGICAL>
cmp dword [esp], 0
setne byte [esp]
and dword [esp], 1
comp dword [esp+4], 0
setne byte [esp+4]
and dword [esp+4], 1
pop eax
and dword [esp], eax    ;or

<NOT-STACK-LOGICAL>
cmp dword [esp], 0
sete byte [esp]
and dword [esp], 1

<MULTIPLY-STACK>
pop eax
imul eax, dword [esp]     ;imul works for signed and unsigned
mov dword [esp], eax

<DIVIDE-UNSIGNED-STACK>
pop ebx
pop eax
mov edx, 0
div ebx
push eax    ;edx for modulo

<DIVIDE-SIGNED-STACK>
pop ebx
pop eax
cdq
idiv ebx
push eax    ;edx for modulo

Your compare code is nasty, creating a store-forwarding stall. Use a tmp register like ECX or EDX. `xor ecx,ecx` / cmp / `setcc cl` / `mov [esp], ecx`. Also, `pop cl` is not encodeable, you need to pop the whole register then read CL. — Peter Cordes, Jun 23 '20 at 20:14
Ah okay. What would the `cmp` be between if I don't have eax? Good to know! I edited my post for `pop cl`. — Hrothgar, Jun 23 '20 at 20:20
Does `xor ecx,ecx` / `pop eax` / `cmp dword [esp], eax` / `setcc cl` / `mov [esp], ecx` work? — Hrothgar, Jun 23 '20 at 20:27
I don't think your scheme of trying to do everything on the stack element in-place is going to help. All those memory accesses don't seem efficient, and if code size is the priority, memory operands have larger encodings than registers. You might try for comparison a version where you pop the stack values into registers, do your computations there, and push at the end. — Nate Eldredge, Jun 23 '20 at 20:40
Along those lines, for `DEREF-STACK`, try `pop eax; push dword [eax]`. That's 3 bytes to your 8. — Nate Eldredge, Jun 23 '20 at 20:42
All the awkward stuff with the logical operators starts to make one see the benefits of Forth, where `FALSE` was always `0` and `TRUE` was `-1`. Then logical AND, OR, NOT, XOR are precisely the corresponding bitwise operations. — Nate Eldredge, Jun 23 '20 at 20:44
Also, if again code size is the main priority, you may find that doing your logical operations with jumps can give you smaller code than the conditional sets. A naive implementation of your `AND-STACK-LOGICAL` with jumps gives me 16 bytes; yours is 27, and the best I've managed without jumps is 23. — Nate Eldredge, Jun 23 '20 at 20:55
@NateEldredge: the branchless `AND-STACK-LOGICAL` in my answer is 17 bytes, doing about the same as what GCC/clang do. I considered `cmov` with another zeroed register, but didn't fully look at what that would do. Note that it has to work on `int` inputs, not boolean, otherwise yes it would be easy, just like with `0` / `1` `bool`s like most C ABIs use. Although real-world compilers do sometimes miss optimizations on them: [Boolean values as 8 bit in compilers. Are operations on them inefficient?](https://stackoverflow.com/q/47243955). If you know your inputs are bools, use bitwise. — Peter Cordes, Jun 23 '20 at 21:29
@NateEldredge: *memory operands have larger encodings than registers* - true for `[esp]` unfortunately, because it needs a SIB byte; [base=ESP is the ModRM escape code for the presence of a SIB byte](https://stackoverflow.com/questions/52522544/rbp-not-allowed-as-sib-base). But for other simple-register addressing modes, `add eax, [edx]` is actually the same size as `add eax, edx`, differing only in the top 2 bits of the ModRM byte to select a register direct operand instead of register indirect. So in practice for this use case, explicit memory operands are larger. — Peter Cordes, Jun 23 '20 at 21:35
@NateEldredge: agreed in general branchy is often smaller, though. Maybe one `setcc` and one branch over an xor-zeroing would work well if optimizing purely for size. — Peter Cordes, Jun 23 '20 at 21:36
@NateEldredge: Yeah, I try to compensate a little with the optimizer which should (hopefully) remove any stack usage from the start and end of each macro as they squish together, but that doesn't help with stack usage in the middle of these macros. I tried to avoid registers as I believe management of them is more difficult(but faster as you say). Perhaps I'll look into some sort of hybrid approach. By the bye, I wonder if the guy who made jonesForth knew that since he chose to make it 1 for true. — Hrothgar, Jun 23 '20 at 22:30
I guess I don't see any problem with register management here. Your functions are simple enough that you're in no danger of running out of registers, and since the next function won't expect the registers to contain anything in particular, you can use whichever ones you want. — Nate Eldredge, Jun 23 '20 at 22:34
It might be a good idea to keep the top of the stack in a register. This makes the code a great deal more efficient in general. — fuz, Jun 23 '20 at 22:37
Oh, I meant if it were to take a larger expression like (1*2/(3+4)-5) and do it in registers, but then I suppose it wouldn't be much of a stack machine, haha. — Hrothgar, Jun 23 '20 at 22:39
@fuz: Indeed! The general consensus seems to be that I should avoid [esp] and try to use registers as much as possible. — Hrothgar, Jun 23 '20 at 22:41

Peter Cordes · Accepted Answer · 2020-06-24T20:02:11.140

Note: An optimizer is used after the macros are expanded to avoid cases like pop eax; push eax so don't worry about that!

In that case you should try to end with a result in a register that you push, allowing the optimizer to chain register operations without store/reload when you do multiple things to the top of the stack.

Push and pop are 1 byte instructions that decode to 1 uop each, and the stack engine in modern x86 CPUs handles the update of ESP, avoiding a data dependency chain through ESP or an extra uop to add/sub ESP. (Using [esp] explicitly in an addressing mode forces a stack-sync uop on Intel CPUs, though, so mixing push/pop and direct access to the stack can be worse. OTOH, add [esp], value can micro-fuse the load+add operations, where separate pop/add/push couldn't.)

Also note that minimizing code-size in bytes is usually secondary to keeping critical-path latency low (easily a problem for a stack architecture that's already going to be storing / reloading with naive JIT transliteration into machine code, not really optimizing between stack ops like a JVM would). And also to minimizing front-end uop counts. All else equal, smaller code size is usually best, though.

Not all instructions decode to a single uop; e.g. add [mem], reg decodes to two, inc [mem] decodes to 3 because the load+inc can't micro-fuse together. Presumably there won't be enough instruction-level parallelism for back-end uops to matter, so all discussion of uop counts is what Intel calls the "fused domain" for the front-end issue stage that feeds uops into the out-of-order back-end.

Just to be clear, this naive JIT of canned sequences with only minor optimization between them might work for a toy project, but real JIT compilers for stack-based bytecode like Java optimize much more, doing register allocation on a per-function basis. If you want actually good performance, this approach is a dead end. If you just want a hobby project to learn some asm and some compiler-construction, this looks like it might be fine. You'll probably want other ops, including swapping the top two elements (pop/pop / push/push).

Your compare code is nasty, creating a store-forwarding stall when and does a dword load that overlaps with the byte-store in a recent instruction. (https://agner.org/optimize/). Use a tmp register like ECX or EDX.

<SIGNED/UNSIGNED-COMPARISONS>  version 1
pop   ecx                ; 1 byte, 1 uop, maybe optimized out
xor   eax,eax            ; 2 bytes, 1 uop
cmp   dword [esp], ecx   ; 3 bytes (ESP addressing mode needs a SIB), 1 uop (micro-fused load+cmp)
setl  al          ; 3 bytes, OF ... 2-byte opcode ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
mov   [esp], eax         ; 3 bytes (ESP addr mode), 1 uop (micro-fused store-address + store-data)

xor-zeroing avoids a possible partial-register penalty for reading EAX after writing AL. Notice that we only write memory once, not store then reload+store again with a memory-destination and. The same optimization applies for <NOT-STACK-LOGICAL>, and somewhat in <AND-STACK-LOGICAL>

Total of 8 or 9 bytes (if leading pop optimizes out), total of 4 or 5 uops plus 1 stack-sync uop, so 5 or 6. But if we optimize for push/pop, giving uop on micro-fusion of cmp+load, in favour of hopefully optimizing away a push/pop pair with the next function:

<SIGNED/UNSIGNED-COMPARISONS>  version 2
pop    ecx                          ; 1 byte, 1 uop, hopefully optimized out
xor    eax,eax       ; EAX=0        ; 2 bytes, 1 uop
pop    edx                          ; 1 byte, 1 uop
cmp    edx, ecx                     ; 2 bytes, 1 uop
setl   al          ; 3 bytes, 1 uop ;setle, setg, setge, setb, setbe, seta, setae, sete, setne for other comparisons
push   eax                          ; 1 byte, 1 uop, hopefully optimized out
  ;; no references to [esp] needing a stack-sync uop

4 to 6 uops if the leading pop and/or trailing push optimize out. 8 to 10 bytes. Being able to optimize out the trailing push also saves an instruction in the next block if it happens. But even more importantly, avoids store/reload latency on the critical path dependency chain.

It's barely worse in the case where it can't optimize away the trailing push, and better if it can, so this is probably good.

If you can select between pop/operate/push versions vs. op [esp] versions based on whether optimizing away the push/pop is possible, that would let you choose the best of both worlds.

<DIVIDE-UNSIGNED-STACK> There's no reason to use EBX here; you can use ECX, like you're already clobbering for shifts, letting you potentially keep something useful in EBX. You can also use xor-zeroing for EDX. For some reason you decided to pop both operands here ahead of div, instead of div dword [esp]. That's fine, as discussed above push/pop can hopefully optimize away.

<DIVIDE-UNSIGNED-STACK>
pop ebx
pop eax
mov edx, 0
div ebx
push eax    ;edx for modulo

Your AND-STACK-LOGICAL should compute in registers, not memory. It's annoying to get a full dword width result and also avoid false dependencies / partial register shenanigans. For example, even GCC chooses to write DL without first xor-zeroing EDX for return a && b;: https://godbolt.org/z/hS9RrA. But fortunately we can choose wisely and mitigate the false dependency for modern CPUs by writing a register that was already part of the dependency chain leading to this instruction. (and eax,edx after writing dl would still lead to a partial-register stall on Nehalem / Core2 or earlier, but those are obsolete.)

<AND-STACK-LOGICAL>
pop   ecx
pop   edx
xor   eax, eax
test  ecx, ecx         ; peephole optimization for cmp reg,0  when you have a value in a reg
setnz al               ; same instruction (opcode) as setne, but the semantic meaning for humans is better for NZ then NE
test  edx, edx
setnz dl
and   eax, edx         ; high garbage in EDX above DL ANDs away to zero.
push  eax
; net 1 total pop = 2 pop + 1 push

This is going to be smaller as well as much better than your AND-STACK-LOGICAL, avoiding multiple store-forwarding stalls.

<DEREF-STACK> could also use some work, as Nate points out:

<DEREF-STACK>
pop eax                   ;eax = address
push dword [eax]          ;push *address

(or to optimize for push/pop elimination, mov eax, [eax] / push eax.)

Wow, this is way more than I was hoping for! Thanks a lot for your very insightful and detailed response! I'll make sure to implement your recommendations. I can understand this approach being a "dead end", but I think if my goal was purely performance, I'd move to a more "register allocation" based approach rather than stack based. As your say, since it's educational/a hobby it's fine. — Hrothgar, Jun 25 '20 at 16:10
As an aside, do you think it'd perform better if I targeted LLVM IR instead of x86 and then ran a professional LLVM optimizer. The only downside I'd see here is that it'd lack any front end optimizations, since much of the semantic information about the original program would be lost at that point. — Hrothgar, Jun 25 '20 at 16:21
@Hrothgar: The LLVM back-end apparently does ok when a front-end hands it a big mess; at least I've read that `rustc` makes pretty bloated LLVM-IR. So yes, I'd expect similar levels of whole-function optimization from LLVM as you'd get when compiling Rust or C. (And maybe even inlining; I think for LTO to work implies that inlining must be possible in the back-end, not just front-end.) It might be more work to generate LLVM-IR, and compile times would likely be slower especially with optimization enabled, but final code quality should be much better for code with more than 1 live var! — Peter Cordes, Jun 25 '20 at 19:44
@Hrothgar: LLVM-IR is like an assembly language with unlimited registers, leaving spill/reload decisions for any given target to the back-end optimizer. I haven't used it directly much, so IDK if you could feed it stuff that looked like pop/compute/push or whether you'd have to change your style of code-gen significantly. — Peter Cordes, Jun 25 '20 at 19:46

Are these the smallest possible x86 macros for these stack operations?

1 Answers1

Linked