Compiler optimizations allowed via "int", "least" and "fast" non-fixed width types C/C++

Question

Clearly, fixed-width integral types should be used when the size is important.

However, I read (Insomniac Games style guide), that "int" should be preferred for loop counters / function args / return codes / ect when the size isn't important - the rationale given was that fixed-width types can preclude certain compiler optimizations.

Now, I'd like to make a distinction between "compiler optimization" and "a more suitable typedef for the target architecture". The latter has global scope, and my guess probably has very limited impact unless the compiler can somehow reason about the global performance of the program parameterized by this typedef. The former has local scope, where the compiler would have the freedom to optimize number of bytes used, and operations, based on local register pressure / usage, among other things.

Does the standard permit "compiler optimizations" (as we've defined) for non-fixed-width types? Any good examples of this?

If not, and assuming the CPU can operate on smaller types as least as fast as larger types, then I see no harm, from a performance standpoint, of using fixed-width integers sized according to local context. At least that gives the possibility of relieving register pressure, and I'd argue couldn't be worse.

If you are targeting a wide range of different architectures, then it makes sense to use a type such as `int` whose size might be defined more suitably for the target architecture. For example, making `int` 4 bytes on an 8-bit architecture that has no instructions that can handle 4-byte numbers doesn't make much sense. On the other hand, you've tagged this question x86. If you are only interested in x86, then I think that `int` is always 4 bytes on all 32-bit and 64-bit x86 platforms. So it is equivalent to `int32_t`. Unless of course you're interested in very old x86 archs. — Hadi Brais, Feb 22 '19 at 11:53
If the size of `int` is the same as the size of a fixed-width signed integer type for a particular architecture, then it doesn't matter which type you use and the same exact binary code will be produced. — Hadi Brais, Feb 22 '19 at 11:55
Yes, I've tagged x86 since it's the only architecture I need to deal with, now that gaming consoles have left the PowerPC world. Understood WRT to the last comment - this is what I expected. — Abel, Feb 22 '19 at 11:59
Yes. On all modern x86 platforms and all of the 4 major C/C++ compilers as far as I know, `int` is exactly equivalent to `int32_t`. These get resolved into the same type by the compiler frontend and have the same effect on compiler optimizations because they are the same type as far as the compiler backend is concerned. — Hadi Brais, Feb 22 '19 at 12:03
And hence, doesn't it make sense to use smaller, fixed-width types pretty much everywhere? According to Agner's tables, for the most part, the latency and throughput of common operations for r/m 8/16/32/64 are similar (div aside). Thus, probably a good general strategy to keep register pressure down? — Abel, Feb 22 '19 at 12:05
While it's true that it's always better code size-wise and perf-wise to use 32-bit registers instead of 64-bit registers (because you can avoid the REX prefix, see Section 10.2 of the Intel optimization manual), this does *not* apply to 16-bit and 8-bit registers. Using these partial registers may introduce additional uops and false register dependencies. — Hadi Brais, Feb 22 '19 at 12:11
See: https://stackoverflow.com/questions/47052342/understanding-partial-register-slowdowns-from-mov-instead-of-movzx-instruction and https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to — Hadi Brais, Feb 22 '19 at 12:12
"_Does the standard permit 'compiler optimizations' for non-fixed-width types?_" **Sure**, the as-if rule still applies. But you seem to be looking for a wider discussion comaring `int` with the fixed-width aliases (in which many trade-offs are involved), so I'm inclined to call this question too broad. — You, Feb 22 '19 at 12:44
@HadiBrais The linked question https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to is particularly relevant here. I never knew about this! — Abel, Feb 22 '19 at 12:58
@You I've been pretty clear that it's purely x86 performance. How is that "too broad"? — Abel, Feb 22 '19 at 12:59
The first link also points you to https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers which talks about how GCC allocates registers for types smaller than 4 bytes. See for example https://godbolt.org/z/8tJpUa how the compiler has chosen allocate 4-byte registers for variables that are smaller than 4 bytes for reasons discussed in my earlier comments. — Hadi Brais, Feb 22 '19 at 13:27
@HadiBrais Yep, I stumbled onto that. Great answer there. Learned something today! Thank you sir. — Abel, Feb 22 '19 at 13:37
@Abel - are you asking (for example) whether `int` which happens to be 64-bit on some platform may be faster than `int32_t` on that platform? Or are you asking (for example), on a platform where `int` and `int32_t` are both 32-bit, whether `int` can _still_ be faster? I interpreted it the second way, but most people here are answering the first. — BeeOnRope, Feb 23 '19 at 00:09

score 4 · Answer 1 · answered Feb 22 '19 at 10:57

4

The reason that the rule of thumb is to use an int is that the standard defines this integral type as the natural data type of the CPU (provided that it is sufficiently wide for the range INT_MIN to INT_MAX. That's where the best-performance stems from.

answered Feb 22 '19 at 10:57

levengli

1,039
6
16

And what in the world does "natural data type" mean on x86-based chips? From https://www.agner.org/optimize/instruction_tables.pdf, it seems like there is no "best" data type. Some operations are slower for smaller width, some for faster. However, when they do differ, it's generally by 1 cycle of latency or so. This is dwarfed by cost of register spillage. – Abel Feb 22 '19 at 11:16
x86 are 32 but wide, hence the widely false assumption that `sizeof(int) == sizeof(int32_t)` for all machines on the planet – levengli Feb 22 '19 at 11:19
I don't see how that comment adds anything to the discussion, nor do I think many people believe this assumption (or at least, I would hope) – Abel Feb 22 '19 at 11:25
1

" the standard defines this integral type as the natural data type of the CPU" It does not. This just happens to be true on 16 and 32 bit computers. But not on 8 and 64 bit ones. – Lundin Feb 22 '19 at 12:43
(provided that it is sufficiently wide for the range INT_MIN to INT_MAX) – levengli Feb 22 '19 at 12:50
But the natural data type of a 64 bit computer is supposedly 64 bits. Yet INT_MAX will be something like 2^31. – Lundin Feb 22 '19 at 13:43
The standard doesn't put an upper bound on the number of values that an `int` can represent. It puts a requirement on the compiler to be able to represent *at least* up to that value – levengli Feb 22 '19 at 14:58
@Lundin: Most 64-bit ISAs are extensions of 32-bit ISAs, and 32-bit is as natural as 64-bit. For x86-64, 64-bit operand size requires an extra REX.W instruction prefix, so `inc rcx` is 3 bytes vs. 2 for `inc ecx`, because nothing else in the instruction required a REX prefix. But you have a point for Alpha AXP: it was 64-bit from the ground up. It did have 32-bit versions of many important instructions like `add`. `int` being 32-bit in Alpha ABIs tells us that it's not so much "natural size" but "efficient and big enough". Wasting cache footprint on 64-bit `int[]` arrays would be bad. – Peter Cordes Feb 22 '19 at 20:45

score 3 · Answer 2 · answered Feb 22 '19 at 12:01

3

There are many things wrong with int_fast types - most notably that they can be slower than int!

#include <stdio.h>
#include <inttypes.h>
int main(void) {
    printf("%zu\n", sizeof (int_fast32_t));
}

Run this on x86-64 and it prints 8... but it makes no sense - using 64-bit registers often require prefixes in x86-64 bit mode, and the "behaviour on overflow is undefined" means that using 32-bit int it doesn't matter if the upper 32 bits of the 64 bit register are set after arithmetic - the behaviour is "still correct".

What is even worse, however, than using the signed fast or least types, is using a small unsigned integer instead of size_t or a signed integer for a loop counter - now the compiler must generate extra code to "ensure the correct wraparound behaviour".

answered Feb 22 '19 at 12:01

Antti Haapala

117,318
21
243
279

I'd rather say that this isn't a fault of the int_fast types, but the specific standard lib. – Lundin Feb 22 '19 at 12:08
@Lundin yes, it is not by necessity that `int_fast32_t` is wrong, but... in this case the "specific standard (not lib)" happens to be the System-V x86-64 ABI – Antti Haapala Feb 22 '19 at 12:09
@AnttiHaapala Can you give an example of the latter point? What extra code is needed? – Abel Feb 22 '19 at 13:04
@AnttiHaapala Thanks for the godbolt. I don't understand why this is necessary though. Am I missing something in the ABI? – Abel Feb 22 '19 at 13:28
@Abel string length can be 64 bits. unsigned int needs wraparound, int on overflow is undefined so all the wraparound issues can be ignored, so actually the entire *int* is elided and it uses a moving pointer instead... – Antti Haapala Feb 22 '19 at 13:29
I am not saying that you *should* use `int`, it is the wrong type to use for *that* code, it *should use `size_t`*, but it was the easiest example I could think of :D – Antti Haapala Feb 22 '19 at 13:31
@AnttiHaapala Oh, an uint overflow *is* defined? Somehow I missed the whole point of your strlen ha :) – Abel Feb 22 '19 at 13:36
The size_t comment is true on _most_ architectures... IIRC avr32 (and others?) has to do a lot more work for size_t. – technosaurus Feb 22 '19 at 13:44
@technosaurus but it is the proper type to use for iterating over strings, array indices... – Antti Haapala Feb 22 '19 at 13:47
@AnttiHaapala @technosaurus - while I understand it's the "proper" type via some notion of type safety, I'm mostly concerned with performance. As I mentioned in another comment, if we transitively follow type safety, the vast majority of types used would be `(s)size_t` – Abel Feb 22 '19 at 14:11

score 2 · Answer 3 · answered Feb 22 '19 at 12:11

2

I'm not very familiar with the x86 instruction set but unless you can guarantee that practically every arithmetic and move instruction also allows additional shift and (sign) extends then the assumption that smaller types are "as least as fast" as larger ones is not true.

The complexity of x86 makes it pretty hard to come up with simple examples so lets consider an ARM microcontroller instead.

Lets define two addition functions which only differ by return type. "add32" which returns an integer of full register width and "add8" which only returns a single byte.

int32_t add32(int32_t a, int32_t b) { return a + b; }
int8_t add8(int32_t a, int32_t b) { return a + b; }

Compiling those functions with -Os gives the following assembly:

add32(int, int):
        add     r0, r0, r1
        bx      lr
add8(int, int):
        add     r0, r0, r1
        sxtb    r0, r0 // Sign-extend single byte
        bx      lr

Notice how the function which only returns a byte is one instruction longer. It has to truncate the 32bit addition to a single byte.

Here is a link to the code @ compiler explorer: https://godbolt.org/z/ABFQKe

answered Feb 22 '19 at 12:11

Vinci

1,060
7
9

I'm not sure why this instruction is necessary. – Abel Feb 22 '19 at 13:15
Because add8 is only allowed to return a valid value within int8_t range. The compiler has to truncate the value inside r0 after the addition to ensure that. The very same thing is true for function arguments. – Vinci Feb 22 '19 at 17:32
1

@Abel: I think the ARM calling convention requires narrow args and return values to be zero- or sign-extended to full register width. x86-64 System V does *not* require that for return values, so a valid implementation of `add8` would be the same as `add32`: `lea eax, [rdi + rsi]` / `ret`. But narrow function args are extended to 32-bit by gcc and clang, and clang even depends on that for incoming args, going beyond the on-paper ABI. [Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?](//stackoverflow.com/a/36760539) – Peter Cordes Feb 22 '19 at 20:33
@Abel: If the inputs has been `int8_t` as well, the fact that signed overflow is undefined behaviour would mean that we could skip the `sxtb`. Adding two 8-bit integers correctly sign-extended to 32-bit will produce a correctly sign-extended 32-bit = 8-bit result, if there's no overflow. (I forget if casting a large integer from `int32_t` to `int8_t` counts as signed overflow. GCC obviously is defining the behaviour there, whether ISO C requires it or not.) – Peter Cordes Feb 22 '19 at 20:39

score 2 · Answer 4 · answered Feb 22 '19 at 12:26

However, I read (Insomniac Games style guide), that "int" should be preferred for loop counters

You should rather be using size_t, whenever iterating over an array. int has other problems than performance, such as being signed and also problematic when porting.

From a standard point-of-view, for a scenario where "n" is the size of an int, there exists no case where int_fastn_t should perform worse than int, or the compiler/standard lib/ABI/system has a fault.

Does the standard permit "compiler optimizations" (as we've defined) for non-fixed-width types? Any good examples of this?

Sure, the compiler might optimize the use of integer types quite wildly, as long as it doesn't affect the outcome of the result. No matter if they are int or int32_t.

For example, an 8 bit CPU compiler might optimize int a=1; int b=1; ... c = a + b; to be performed on 8 bit arithmetic, ignoring integer promotions and the actual size of int. It will however most likely have to allocate 16 bits of memory to store the result.

But if we give it some rotten code like char a = 0x80; int b = a >> 1;, it will have to do the optimization so that the side affects of integer promotion are taken in account. That is, the result could be 0xFFC0 rather than 0x40 as one might have expected (assuming signed char, 2's complement, arithmetic shift). The a >> 1 part isn't possible to optimize to an 8 bit type because of this - it has to be carried out with 16 bit arithmetic.

I disagree about `size_t` - but that's because I have the luxury of not needing to support imaginary architectures. Your first argument about `int` doesn't apply, since I've stated using fixed-width integral types is what should be done when size needs to be known (ex: portability). Your last point, if I understand, is that the compiler can make operation/size transformations freely if it can reason about the values of the operands? Additionally, it applies equally-well to fixed and non-fixed types. Thus, at least from these arguments, the "no worse off using fixed" idea still stands, no? — Abel, Feb 22 '19 at 13:10
@Abel It has nothing to do with portability but type safety. `size_t` is simply the most proper type, and what you get returned/passed to all standard library functions. Indeed the compiler can perform optimizations if it can reason about the outcome. However, `int` plays a special role as it is the resulting type of implicit integer promotion from small integer types. And finally, I'm not raising "arguments", as that's not what this site is for. The general best practice is to use size_t and stdint.h as much as possible, then write code which contains as few implicit promotions as possible. — Lundin, Feb 22 '19 at 13:40
If we respected "type safety" and transitively follow the consequence, most of the code would be written with `size_t`. A valid point about `int` and reducing implicit promotions - I hate the cursed things. And sorry, I didn't mean to imply "arguments" with a negative connotation - simply just "your points" in the discussion :) — Abel, Feb 22 '19 at 14:04

BeeOnRope · Answer 5 · 2019-02-23T22:01:06.673

0

I think the question you are trying to ask is:

Is the compiler allowed to make additional optimizations for a non-fixed-width type such as int beyond what it would be allowed for a fixed width type like int32_t that happens to have the same length on the current platform?

That is, you are not interested in the part where the size of the non-fixed width type is allowed to be chosen appropriately for the hardware - you are aware of that and are asking if beyond that additional optimizations are available?

The answer, as far as I am aware or have seen, is no. No both in the sense that compilers do not actually optimize int differently than int32_t (on platforms where int is 32-bits), and also no in the sense that there are not optimizations allowed by the standard for int which are also not allowed for int32_t¹ (this second part is wrong - see comments).

The easiest way to see this is that the various fixed width integers are all typedefs for various underlying primitive integer types - so on a platform with 32-bit integers int32_t will probably be a typedef (perhaps indirectly) of int. So from a behavioral and optimization point of view, the types are identical, and as soon as you are in the IR world of the compiler, the original type probably isn't even really available without jumping through oops (i.e., int and int32_t will generate the same IR).

So I think the advice you received was wrong, or at best misleading.

¹ Of course the answer to the question of "Is it allowed for a compiler to optimize int better than int32_t is yes, since there are not particular requirements on optimization so a compiler could do something weird like that, or the reverse, such as optimizing int32_t better than int. I that's not very interesting though.

edited Feb 23 '19 at 22:01

answered Feb 23 '19 at 00:03

BeeOnRope

51,419
13
149
309

Yup, `int32_t` *is* `int` in normal compilers on normal 32-bit and 64-bit ISAs. `int32_t` is required to be 2's complement with no padding, but signed overflow of it is still UB so all the usual optimizations still apply. (Fun fact: `atomic` is also required to be 2's complement, and overflow is *not* UB. But that's hardly useful because atomic is so slow and optimization-defeating.) Anyway, a use-case for `int32_t` is code that uses 2's complement bithacks. – Peter Cordes Feb 23 '19 at 18:24
OK, interesting - so there is a difference between `int` and `int32_t` (even when they are the same size) in that the latter has more associated semantics. So I guess this answer is partly wrong: in theory the standard admits more optimizations on `int` since there are fewer restrictions, and perhaps some compiler could take advantage of that even on a non-weird platform. – BeeOnRope Feb 23 '19 at 20:18
1

A C implementation on a one's complement machine would probably just *not* provide `int32_t`. Types like `int_fast32_t` don't have any of the guarantees I mentioned, which is why they're non-optional but `int32_t` is. – Peter Cordes Feb 23 '19 at 20:48
@PeterCordes - right, but I'm also thinking of the case where the non-guarantees are used by a normal machine to implement some clever optimization. Similar to how the overflow non-guarantees are used on machines where overflow is perfectly reasonable to optimize things (even though I think the UB was put in place to accommodate other hardware archs, not to allow this on hardware archs with reasonable overflow semantics). – BeeOnRope Feb 23 '19 at 20:51
Given `sizeof(int) = 4` and fixed `INT_MAX` and `INT_MIN`, you can't introduce padding bits or shrink `int` on a per-function basis, or make it one's complement or sign/magnitude. But if you can prove that a private array (local and automatic storage, address doesn't escape) only needs to have 16-bit elements, the as-if rule already allows you to do that for `int` or `int32_t` e.g. on x86 with the x86-64 SysV ABI. An `int32_t` or `int` loop counter can already be optimized to 64-bit to remove the need for redoing sign-extension before array indexing; compilers do that in practice. – Peter Cordes Feb 23 '19 at 22:22
2

The differences are only in what implementation-defined behaviour choices are allowed. Once an implementation *makes* its choice for `int`, there's no difference in what's UB or not UB between `int32_t` and `int` if they are in fact the same (32-bit 2's complement no padding, like in every normal ABI on modern 32 and 64-bit ISAs). – Peter Cordes Feb 23 '19 at 22:24
@Peter, yeah I guess you are right. I suppose an implementation could do something perverse like choose different implementation-defined behaviors for `int` vs `int32_t` which somehow makes `int` faster but this seems extremely unlikely on modern hardware. – BeeOnRope Feb 25 '19 at 02:30
Yes, exactly. `int32_t` fits modern hardware perfectly. An implementation could choose right shifts to be arithmetic for one, logical for the other, but otherwise I can't think of anything useful. (And unfortunately ISO C/C++ don't require `int32_t` right shifts to be arithmetic, so a perverse implementation that wants them to be different could go either way. Seems like a missed opportunity to give C programmers a guaranteed way to get arithmetic right shifts.) – Peter Cordes Feb 25 '19 at 02:46

Compiler optimizations allowed via "int", "least" and "fast" non-fixed width types C/C++

5 Answers5

Linked