11

The x86-64 SysV ABI specifies, among other things, how function parameters are passed in registers (first argument in rdi, then rsi and so on), and how integer return values are passed back (in rax and then rdx for really big values).

What I can't find, however, is what the high bits of parameter or return value registers should be when passing types smaller than 64-bits.

For example, for the following function:

void foo(unsigned x, unsigned y);

... x will be passed in rdi and y in rsi, but they are only 32-bits. Do the high 32-bits of rdi and rsi need to be zero? Intuitively, I would assume yes, but the code generated by all of gcc, clang and icc has specific mov instructions at the start to zero out the high bits, so it seems like the compilers assume otherwise.

Similarly, the compilers seem to assume that the high bits of the return value rax may have garbage bits if the return value is smaller than 64-bits. For example, the loops in the following code:

unsigned gives32();
unsigned short gives16();

long sum32_64() {
  long total = 0;
  for (int i=1000; i--; ) {
    total += gives32();
  }
  return total;
}

long sum16_64() {
  long total = 0;
  for (int i=1000; i--; ) {
    total += gives16();
  }
  return total;
}

... compile to the following in clang (and other compilers are similar):

sum32_64():
...
.LBB0_1:                               
    call    gives32()
    mov     eax, eax
    add     rbx, rax
    inc     ebp
    jne     .LBB0_1


sum16_64():
...
.LBB1_1:
    call    gives16()
    movzx   eax, ax
    add     rbx, rax
    inc     ebp
    jne     .LBB1_1

Note the mov eax, eax after the call returning 32-bits, and the movzx eax, ax after the 16-bit call - both have the effect of zeroing out the top 32 or 48 bits, respectively. So this behavior has some cost - the same loop dealing with a 64-bit return value omits this instruction.

I've read the x86-64 System V ABI document pretty carefully, but I couldn't find whether this behavior documented in the standard.

What are the benefits of such a decision? It seems to me there are clear costs:

Parameter Costs

Costs are imposed on the implementation of callee when dealing with parameter values. and in the functions when dealing with the parameters. Granted, often this cost is zero because the function can effectively ignore the high bits, or the zeroing comes for free since 32-bit operand size instructions can be used which implicitly zero the high bits.

However, costs are often very real in the cases of functions that accept 32-bit arguments and do some math that could benefit from 64-bit math. Take this function for example:

uint32_t average(uint32_t a, uint32_t b) {
  return ((uint64_t)a + b) >> 2;
}

A straightforward use of 64-bit math to calculate a function that would otherwise have to carefully deal with overflow (the ability to transform many 32-bit functions in this way is an often unnoticed benefit of 64-bit architectures). This compiles to:

average(unsigned int, unsigned int):
        mov     edi, edi
        mov     eax, esi
        add     rax, rdi
        shr     rax, 2
        ret  

Fully 2 out of the 4 instructions (ignoring ret) are needed just to zero out the high bits. This may be cheap in practice with mov-elimination, but still it seems a big cost to pay.

On other hand, I can't really see a similar corresponding cost for the callers if the ABI were to specify that high bits are zero. Because rdi and rsi and the other parameter passing registers are scratch (i.e., can be overwritten by the caller), you only have a couple scenarios (we look at rdi, but replace it with the paramter reg of your choice):

  1. The value passed to the function in rdi is dead (not needed) in the post-call code. In that case, whatever instruction last assigned to rdi simply has to assign to edi instead. Not only is this free, it is often one byte smaller if you avoid a REX prefix.

  2. The value passed to the function in rdi is needed after the function. In that case, since rdi is caller-saved, the caller needs to do a mov of the value to a callee-saved register anyway. You can generally organize it so that the value starts in the callee saved register (say rbx) and then is moved to edi like mov edi, ebx, so it costs nothing.

I can't see many scenarios where the zeroing costs the caller much. Some examples would be if 64-bit math is needed in the last instruction which assigned rdi. That seems quite rare though.

Return value costs

Here the decision seems more neutral. Having callees clear out the junk has a definite code (you sometimes see mov eax, eax instructions to do this), but if garbage is allowed the costs shifts to the callee. Overall, it seems more likely that the caller can clear the junk for free, so allowing garbage doesn't seem overall detrimental to performance.

I suppose one interesting use-case for this behavior is that functions with varying sizes can share an identical implementation. For example, all of the following functions:

short sums(short x, short y) {
  return x + y;
}

int sumi(int x, int y) {
  return x + y;
}

long suml(long x, long y) {
  return x + y;
}

Can actually share the same implementation1:

sum:
        lea     rax, [rdi+rsi]
        ret

1 Whether such folding is actually allowed for functions that have their address taken is very much open to debate.

Community
  • 1
  • 1
BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • Doesn't appear to be specified for the INTEGER class type in either the i386 and SYSV ABI (even the latest revisions). There is a discussion over this on the GCC mailing list as well https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46942 – Michael Petch Nov 08 '16 at 04:27
  • 1
    One other thing to note that may be of interest is that the original i386 ABI this was specified _Functions pass all integer-valued arguments as words, expanding or padding signed or unsigned bytes and halfwords as needed._ . In that ABI these definitions are used _Within this specification, the term halfword refers to a 16-bit object, the term word refers to a 32-bit object, and the term doubleword refers to a 64-bit object._ . The ABI in question (circa 1997) can be found here: http://www.sco.com/developers/devspecs/abi386-4.pdf – Michael Petch Nov 08 '16 at 04:49
  • Great link to the gcc issue. Feel free to summarize it into an answer if you want, or I will at some point. The other link to the i386 document is interesting, in as much as the current "unwritten ABI" behavior per the gcc link seems to follow that convention up to 32-bits (i.e., 8-bit or 16-bit values _are_ zero/sign extended by the _caller_), but then the convention for the high 32-bits out of 64- convention is different (8, 16 or 32-bit values are _not_ zero/sign extended by the caller). – BeeOnRope Nov 08 '16 at 15:43
  • 1
    Possible duplicate of http://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-required-when-adding-a-32bit-offset-to-a-pointer-for. TL:DR: yes, as confirmed this year by Michael Matz (one of the ABI maintainers), args can contain high garbage, including XMM registers holding scalar floats. (so don't raise spurious FP exceptions with careless use of packed ops on args of scalar functions.) For integer, narrow function args are sign- or zero-extended to 32-bit by the caller. (This is undocumented behaviour that clang depends on.) Only args, not return values. – Peter Cordes Nov 08 '16 at 23:02
  • My guess is that the main motivation for leaving things to the receiving side is that it might want to sign-extend or zero-extend, and the caller doesn't know which. It's too bad that receivers can't take advantage of the fact that callers usually zero the upper bits for free, but it does make things more robust against prototype mismatches. (i.e. you will never access outside of a 4GB array when indexing it with a `uint32_t` function arg, even if the caller thinks you take a `uint64_t` and passes a larger value.) – Peter Cordes Nov 08 '16 at 23:09
  • I agree the cost for the caller to zero-extend 32-bit args is usually zero, but very often people pass signed integers, and the callee has to sign-extend. Perhaps the idea is that each function has multiple callers (or else if should just be inlined), so the cost in static instruction count / code-size / cache-footprint is lower if any work that needs doing is done in the callee. – Peter Cordes Nov 15 '16 at 20:26
  • 1
    I'm not a huge fan of the current design, especially for small functions, but it is robust. (Of course, small functions should be inlined...) Fun fact: IIRC, `mov same, same` always needs an execution unit on Intel CPUs, but MOV between two different architectural registers can be eliminated (IvB and later). – Peter Cordes Nov 15 '16 at 20:27
  • Well yeah, except for the part about 8-bit and 16-bit parameters being zero extended, i.e., the opposite behavior is the convention there? I guess the current state came about largely as a result of existing conditions on the ground wrt compiler behavior, with a dash of i386 influence. – BeeOnRope Nov 15 '16 at 20:30
  • Yeah about `mov same, same` not being eliminated, it is odd to me since obviously something like `mov rax, rax` is a trivial no-op and I would have thought `mov eax, eax` could be handled by the same renaming machinery that also makes `mov eax, ebx` work. AFAIK the latter works by pointing architectural register `rax` to the same underlying physical register as `rbx`, but with a 'zero high bits' indicator so that when `rax` is read, the high bits are zero rather than coming from the physical register. It _seems_ like `mov eax, eax` could be implemented in the same way... – BeeOnRope Nov 15 '16 at 20:39
  • ... i.e., just updating the arch -> phys pointer for `rax` to have the `32-bit zero` bit set, so even slightly easier than the other renaming. The parallel renamer is a highly performance sensitive bit of hardware with pretty intense critical paths though, so perhaps "can do it" boils down to "aren't doing it" just because the different register case is much more common and maybe at the hardware level they aren't nearly as common as I would imagine. – BeeOnRope Nov 15 '16 at 20:41
  • (only just now saw your replies since you didn't notify me with a \@peter). IDK why it isn't eliminated, I agree it's odd. I agree that it's most likely due to some extra complication for the renamer, and handling it wasn't deemed worth it. Remember that mov-elimination usually works even for results that aren't ready yet, but a physical reg does have to be reserved in any case. Still, it's not just the physical reg, it's also forwarding through the bypass network. (None of this explains why the same architectural register is a special case, just that It's Complicated.) – Peter Cordes Nov 21 '16 at 23:55
  • Also, mov-elimination failure isn't horrible for performance in most cases. SnB is pretty fast, and didn't have it at all! Agner Fog's manual says (for IvB) "It fails when the necessary operands are not ready. But typically, move elimination succeeds in more than 80% of the possible cases." IDK if that's strictly true (since it seems incompatible with 80% success), or if it really fails as often as 20%. IACA thinks it never fails. If you're feeling ambitious, it might make a fun project to see what kind of loop you need to construct to make it (almost) always fail. – Peter Cordes Nov 21 '16 at 23:59
  • @PeterCordes I never understood the "necessary operands are not ready" part anyway. Yeah, I can't imagine `mov` elimination is critical since even without it's a 1 uop instruction that can issue on 4 ports usually. It's just a minor bump overall like many of the recent architectural improvements. – BeeOnRope Nov 23 '16 at 02:08
  • @BeeOnRope: It's most useful for latency as part of a loop-carried dep chain, which is the one case where Agner's guide suggests it wouldn't work (but IACA doesn't treat it that way, so I assume it is normally eliminated the way IACA thinks). For throughput, I think it mostly only avoids resource conflicts, and I guess leaves more room in the RS for OOO to see farther ahead. Vector mov-elimination is more useful because MOVDQA can only run on 3 ports. – Peter Cordes Nov 23 '16 at 03:33

1 Answers1

6

It looks like you have two questions here:

  1. Do the high bits of a return value need to be zeroed before returning? (And do the high bits of arguments need to be zeroed before calling?)
  2. What are the costs/benefits associated with this decision?

The answer to the first question is no, there can be garbage in the high bits, and Peter Cordes has already written a very nice answer on the subject.

As for the second question, I suspect that leaving the high bits undefined is overall better for performance. On one hand, zero-extending values beforehand comes at no additional cost when 32-bit operations are used. But on the other hand, zeroing the high bits beforehand is not always necessary. If you allow garbage in the high bits, then you can leave it up to the code that receives the values to only perform zero-extensions (or sign-extensions) when they are actually required.

But I wanted to highlight another consideration: Security

Information leaks

When the upper bits of a result are not cleared, they may retain fragments of other pieces of information, such as function pointers or addresses in the stack/heap. If there ever exists a mechanism to execute higher-privileged functions and retrieve the full value of rax (or eax) afterwards, then this could introduce an information leak. For example, a system call might leak a pointer from the kernel to user space, leading to a defeat of kernel ASLR. Or an IPC mechanism might leak information about another process' address space that could assist in developing a sandbox breakout.

Of course, one might argue that it is not the responsibility of the ABI to prevent information leaks; it is up to the programmer to implement their code correctly. While I do agree, mandating that the compiler zero the upper bits would still have the effect of eliminating this particular form of an information leak.

You shouldn't trust your input

On the other side of things, and more importantly, the compiler should not blindly trust that any received values have their upper bits zeroed out, or else the function may not behave as expected, and this could also lead to exploitable conditions. For example, consider the following:

unsigned char buf[256];
...
__fastcall void write_index(unsigned char index, unsigned char value) {
    buf[index] = value;
}

If we were allowed to assume that index has its upper bits zeroed out, then we could compile the above as:

write_index:  ;; sil = index, dil = value
      ; movzx esi, sil       ; skipped based on assumptions
    mov [buf + rsi], dil
    ret

But if we could call this function from our own code, we could supply a value of rsi out of the [0,255] range and write to memory beyond the bounds of the buffer.

Of course, the compiler would not actually generate code like this, since, as mentioned above, it is the responsibility of the callee to zero- or sign-extend its arguments, rather than that of the caller. This, I think, is a very practical reason to have the code that receives a value always assume that there is garbage in the upper bits and explicitly remove it.

(For Intel IvyBridge and later (mov-elimination), compilers would hopefully zero-extend into a different register to at least avoid the latency, if not the front-end throughput cost, of a movzx instruction.)

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
user1354557
  • 2,163
  • 15
  • 29
  • 1
    Info leaks can only be called that across some kind of privilege boundary, not for function calls within a process. Calling a library function gives it total control over your process, so you're screwed if it's untrustworthy. Your argument about reducing attack surface by having a function make defensive assumptions is good, though. (gcc behaves like that, but clang doesn't. e.g. it would emit `mov [buf+rsi], dil` / `ret`. (If it was going to separately put an address into a reg, it would be with a RIP-relative LEA. Static non-PIC addresses are in the low 2GB in the default code model). – Peter Cordes Nov 10 '16 at 21:14
  • 1
    I assume that Linux (the kernel) is careful to zero-extend to fill RAX on return from system calls, presumably even in weird cases like using the 32-bit `int 0x80` ABI from a 64-bit process. On process startup, it zeros all regs except RSP, even though the ABI says they hold garbage at that point. (Dynamically linked processes do get garbage in regs at the start of `_start`, because the dynamic linker runs first, in the context of the process and has no info to leak, so it just follows the ABI and doesn't bother zeroing regs.) – Peter Cordes Nov 10 '16 at 21:16
  • 1
    actually clang only assumes zero/sign-extension to 32-bit, not to 64-bit, like I said in my other answer. It would probably use `mov eax, esi` to zero-extend, rather than using an address-size prefix (which would be safe for a static `buf` known to be in the low 31 bits: `mov [buf+esi], dil`) – Peter Cordes Nov 10 '16 at 21:34
  • 1
    @PeterCordes and @user1354557 - to clarify though, Peter's answer makes a distinction between return values and parameters: (1) for _return values_, it seems that _all_ bits beyond the size of the returned value can contain garbage. E.g., if you return a `char` value, then bits `[63:8]` may have garbage. For _parameter values_ on the hand, it seems that they are **extended to 32-bits**, but not to 64-bits? So the rules are fairly complex and it's pretty awful they aren't documented. – BeeOnRope Nov 15 '16 at 19:10
  • About not trusting your inputs - the example is interesting, but it's also a bit contrived. _Usually_ "not trusting your inputs" means that the inputs need to explicitly validated by user-written code. The example is an unusual case where, because of the size of the arguments, no validation is necessarily because of the limited range of `char` - but it may be broken by "garbage bits". In almost every real-world example however, the code is going to need to test buffer sizes and boundaries explicitly, because they can't rely on these implicit guarantees. – BeeOnRope Nov 15 '16 at 19:14
  • @BeeOnRope: yep, that's correct, and agreed that it's really bad that this part of the ABI isn't documented, since clang is already relying on it. Also agreed that this example is a bit weird. You have to trust your caller to some degree, and that not all functions need to treat their args as untrusted user-input – Peter Cordes Nov 15 '16 at 19:15
  • I suppose though it wouldn't even be _possible_ to check the input values in your example. If you have a function like `void foo(unsigned char x)` and you write `if (x > 255)` in an attempt to validate the input, the compiler may just omit it since such a condition is trivially false? It's an interesting case that kind of calls between the cracks of the typical caller/callee contract. – BeeOnRope Nov 15 '16 at 19:16
  • @PeterCordes ... when you say "clang is relying on it", do you mean the extension up to 32-bits of smaller-than-32-bits args? – BeeOnRope Nov 15 '16 at 19:17
  • 1
    @BeeOnRope: Yes, like I said in my answer on the other question, clang generates code that assumes that, but gcc doesn't. With an example IIRC. – Peter Cordes Nov 15 '16 at 19:17
  • @BeeOnRope: you can't write such checks in C, the compiler already takes care of following the ABI and not having out-of-range values for variables. If you were hand-writing the asm, you could `movzx eax, al` if you didn't trust the caller to have done that for you. – Peter Cordes Nov 15 '16 at 19:28
  • 1
    @BeeOnRope Regarding bounds checks and sanitizing bad input, I would expect something like `volatile unsigned int xl = x; xl &= 0xFF;` to work but I have not tried – user1354557 Nov 15 '16 at 19:45
  • 2
    @PeterCordes - right, which is of course my point - the ABI and contracts in general are two-sided: the caller is supposed to follow the rules, and the callee is supposed to follow the rules. Also, the _callee_ often validates that the caller followed the rules, to the extent possible (like checking buffer sizes). In this weird case, you can't check the values even if you wanted to. In the `clang` example, you can have a `char` with values > 255, impossible according the language spec. Sure, the caller didn't follow the (debated) contract, but I'd expect to be able to check for invalid values. – BeeOnRope Nov 15 '16 at 19:54
  • 1
    @user1354557 - in `clang` at least, it works: https://godbolt.org/g/jl6NZk - using volatile causes the mask to be applied. The code it generates is really slow though! – BeeOnRope Nov 15 '16 at 19:55
  • 2
    ... so I conclude that on `clang` at least, for security sensitive functions, it's probably best to use 32-bit or larger arguments, or else you may be opening yourself up to subtle exploits with high garbage bits. – BeeOnRope Nov 15 '16 at 19:58
  • @BeeOnRope: yes, I see your point. It kind of occurred to me while writing my last comment, too, that there's no good way to detect ABI violations from within C, e.g. linking code from different compilers together. Or hand-written asm callers. I wonder if clang has an option to control this undocumented-ABI behaviour? – Peter Cordes Nov 15 '16 at 20:02
  • 1
    @user1354557: C would benefit enormously from an operator which, given an lvalue whose type has no trap representations, would yield its defined value (if it has one), or any arbitrary valid value (if it doesn't). If programmers are allowed to let things that may or may not be used hold indeterminate values, but can then launder them if they're actually needed, that will allow many optimizations that would be impossible if programmers have to set everything to deterministic values even when code doesn't care what value they hold. – supercat Sep 29 '17 at 21:27