A compiler-generated cmp/jcc
on acpi_version == 0
is about as cheap as you're going to get in the general case. It should predict very well because the branch always goes the same way every time, and the cost of a branch itself is quite low, even if it's taken every time. (Taken branches have slightly more cost than not-taken branches, because of their effect on the front-end fetch/decode stages and because it breaks up the used parts of the code across more I-cache lines.)
CPUs can't store compare results in any special way that's faster than testing an integer for non-zero. (i.e. a zero / non-zero integer already is how you would store a compare result!)1
In the specific case where the two sides of your if
are very similar, there might be room for savings, but an easily-predicted compare+branch is very cheap. Programs are full of compare+branch, so modern CPUs have to be very good at running them. Something like 10 to 25% of the instruction count in a normal program is compare&branch, including unconditional jumps (don't quote me on the exact number, I've heard this but couldn't find a reliable source with a quick search). More branches takes up slightly more branch prediction resources, or on average worsens the prediction of other branches, but this is a small effect, too.
The compiler already will hoist the check out of loops after inlining. (By far the most important thing you can do here is make sure small accessor functions like sdt_wrapper::get_table
can inline, either by putting them in the .h
or using link-time optimization) Inline asm can only make it worse (http://gcc.gnu.org/wiki/DontUseInlineAsm) unless you do some super-hack like putting a label in the asm code so you can modify it or something.
If you compare so much that you think it would be worth keeping acpi_version
in a fixed global register dedicated only to that, (a global register variable, which GNU C++ supports but which would probably not actually be good2 even if you think it might be), then you could instead make the condition a template parameter for all your code (or a static constexpr
or CPP macro), and build 2 versions of your code: one for true and one for false. When you find out the value of the condition on bootup, unmap and reclaim the pages holding the version that will never run, and jump to the version that will run. (Or for a non-kernel, i.e. a normal program running in userspace under an OS, it's usually not a perf problem to just leave clean pages mapped, especially if untouched (including by runtime relocation)). if(acpi_version == 0) { rest_of_kernel<0>(...); } else { rest_of_kernel<1>(...); }
(omitting the unmap / free part).
If rsdt_ptr
and xsdt_ptr
are invariant, you could at least eliminate that extra level of indirection, if both PointerToOtherSDT
arrays (PTOS
for short) are in static storage.
Patching the code once at startup
You didn't tag an architecture, but your (broken in too many ways to mention) asm
appears to be x86 so I'll talk about that. (All modern x86 CPUs have out-of-order execution and very good branch prediction, so there's probably not much to gain there, though.)
The Linux kernel does this, but it's complex: e.g. something like .pushsection list_of_addresses_to_patch; .quad .Lthis_instance%= ; .popsection
to build an array of pointers (as a special linker section) to places that need to be patched, everywhere the asm statement inlined. One place this trick is used is to patch lock
prefixes to nop
on uniprocessor machines running a kernel compiled with SMP support. This patching happens once, at bootup. (And it may even be able to patch the lock
prefixes back in before hot-adding a CPU, because the mutex counters are still being maintained.)
In fact, Linux even uses asm goto
and patches between a jmp
or nop
for invariant conditions like yours that are determined once at bootup, in bool _static_cpu_has(u16 bit)
in arch/x86/include/asm/cpufeature.h. To start with, there's a jmp
to a block that does a normal runtime check, testing a bit. But it uses .section .altinstructions,"a"
/ .previous
to record where each jmp
is, and the patch length / location. It appears cleverly constructed to work with 2-byte short vs. 5-byte long jmp rel8
/ jmp rel32
jumps. So the kernel can patch all the locations where this code ends up, replacing the jmp
with either a jmp
to the right spot or a nop
to fall through to the t_yes: return true
label. gcc compiles this fairly nicely when you write if(_static_cpu_has(constant)) { ... }
. After patching at some point after doing CPU feature detection, you end up with just a NOP and then fall into the loop body. (Or possibly multiple short-NOP instructions, I didn't check, but hopefully not!)
This is pretty damn cool, so I'm just going to copy the code because it's fun to see such a creative use of inline asm. I haven't looked for the code that does the patching, but obviously that + the linker script are the other key parts of this. I'm not trying to provide a workable version of it for this case, just show that the technique is possible, and where to find a GPLv2 implementation of it you could copy.
// from Linux 4.16 arch/x86/include/asm/cpufeature.h
/*
* Static testing of CPU features. Used the same as boot_cpu_has().
* These will statically patch the target code for additional
* performance.
*/
static __always_inline __pure bool _static_cpu_has(u16 bit)
{
asm_volatile_goto("1: jmp 6f\n"
"2:\n"
".skip -(((5f-4f) - (2b-1b)) > 0) * "
"((5f-4f) - (2b-1b)),0x90\n"
"3:\n"
".section .altinstructions,\"a\"\n"
" .long 1b - .\n" /* src offset */
" .long 4f - .\n" /* repl offset */
" .word %P[always]\n" /* always replace */
" .byte 3b - 1b\n" /* src len */
" .byte 5f - 4f\n" /* repl len */
" .byte 3b - 2b\n" /* pad len */
".previous\n"
".section .altinstr_replacement,\"ax\"\n"
"4: jmp %l[t_no]\n"
"5:\n"
".previous\n"
".section .altinstructions,\"a\"\n"
" .long 1b - .\n" /* src offset */
" .long 0\n" /* no replacement */
" .word %P[feature]\n" /* feature bit */
" .byte 3b - 1b\n" /* src len */
" .byte 0\n" /* repl len */
" .byte 0\n" /* pad len */
".previous\n"
".section .altinstr_aux,\"ax\"\n"
"6:\n"
" testb %[bitnum],%[cap_byte]\n"
" jnz %l[t_yes]\n"
" jmp %l[t_no]\n"
".previous\n"
: : [feature] "i" (bit),
[always] "i" (X86_FEATURE_ALWAYS),
[bitnum] "i" (1 << (bit & 7)),
[cap_byte] "m" (((const char *)boot_cpu_data.x86_capability)[bit >> 3])
: : t_yes, t_no);
t_yes:
return true;
t_no:
return false;
}
Runtime binary patching for your specific case
In your specific case, the difference between your two versions is which global(?) pointer you're dereferencing, and the type of PTOS
. Storing a pointer to the base of the right array (as a void*
or char*
) is easy with pure C++, but indexing differently is tricky. In your case, it's an array of uint32_t
or uint64_t
, as a flexible array member at the end of a struct. (Actually uint32_t PTOS[1]
because ISO C++ doesn't support flexible array members, but if you're going to use GNU inline asm syntax, you a proper flexible array member like uint32_t PTOS[]
is probably a good idea).
On x86-64, changing the scale factor in an indexed addressing mode from 4 to 8 would do the trick, because a 64-bit load vs. a zero-extending 32-bit load uses the same opcode, just REX.W=0 (or no REX prefix) vs. REX.W=1 for the operand size. .byte 0x40; mov eax, [rdx + rdi*4]
has the same length as mov rax, [rdx + rdi*8]
. (The 0x40
byte in the first is a REX prefix with all its bits clear. The 2nd version needs REX.W=1 for 64-bit operand size; the first one zero extends into RAX by writing EAX. If the first version already needed a REX prefix for a register like r10
, it would already have a REX prefix.) Anyway, patching one to the other would be easy if you knew where all the relevant instructions were located.
If you had an infrastructure for recording places to patch, you'd use it to patch a mov
instruction that gets a table pointer and index
in registers, and returns a 64-bit value (from a 32 or 64-bit load). (And don't forget a dummy input to tell the compiler you actually read the memory pointed to by the table pointer, otherwise the compiler is allowed to do optimizations that could break your code like moving stores across the asm
statement). But you'd have to be careful; inline asm can hurt optimization by disabling constant propagation (e.g. for index
). At least if you omit volatile
, the compiler is allowed to consider it a pure function of the inputs and CSE it.
Hacking this up in pure C++
On x86, the scale-factor in an addressing has to be coded into the instruction. Even with a runtime-invariant, you (or the compiler) would still need a variable-count shift, or a multiply, to pull this off without self-modifying code (which compilers won't emit).
A variable-count shift costs 3 uops on an Intel Sandybridge-family CPU (http://agner.org/optimize/) (because of legacy CISC semantics; count=0 leaves EFLAGS unmodified so EFLAGS is an input to variable-count shifts.) Unless you let the compiler use BMI2 for shlx
(flagless shifts). index += foo ? 0 : index
would conditionally double index
(a difference of one shift count), but doing that branchlessly is not worth it on x86 for a condition that predicts that well.
Using a variable-count shift instead of a scaled-index addressing mode could be more costly than a well-predicted conditional branch.
uint64_t
vs. uint32_t
without runtime patching is another problem; one version needs to do a zero-extending 32-bit load, and the other needs to do a 64-bit load (unless the upper bytes happen to always be zero for you?) We could always do a 64-bit load and then mask to keep or discard the upper 32 bits, but that needs another constant. And it can suffer a performance penalty if the load crosses a cache-line (or worse, page) boundary. e.g. if the 32-bit value was the last one in a page, a normal 32-bit load would have just loaded it, but a 64-bit load + mask would need to load data from the next page.
But with both of these things taken together, it's really not worth it. Just for fun, here's what you could do: source + asm output on the Godbolt compiler explorer
// I'm assuming rsdt_ptr and xsdt_ptr are invariants, for simplicity
static const char *selected_PTOS;
static uint64_t opsize_mask; // 0x00000000FFFFFFFF or all-ones
static unsigned idx_scale; // 2 or 3
// set the above when the value for acpi_version is found
void init_acpi_ver(int acpi_version) {
... set the static vars;
}
// branchless but slower than branching on a very-predictable condition!
SDT::generic_sdt* sdt_wrapper::get_table (size_t index)
{
const char *addr = selected_PTOS + (index << idx_scale);
uint64_t entry = *reinterpret_cast<const uint64_t*>(addr);
entry &= opsize_mask; // zero-extend if needed
return reinterpret_cast<SDT::generic_sdt*>(entry);
}
Asm output from Godbolt (with simpler types so it actually compiles)
get_table(unsigned long):
mov ecx, DWORD PTR idx_scale[rip]
mov rax, QWORD PTR selected_PTOS[rip] # the table
sal rdi, cl
mov rax, QWORD PTR [rax+rdi] # load the actual data we want
and rax, QWORD PTR opsize_mask[rip]
ret
With inlining and CSE, the compiler could keep some of those mask and shift-count values in registers, but that's still extra work (and ties up registers).
And BTW, do not make the static
vars locals inside a function; that would force the compiler to check every time to see if it was the first time the function had executed. The fast path for a static local
(all that runs once the dust has settled from the init code) is pretty cheap, but about the same cost as what you're trying to avoid: a branch on an integer being non-zero!
int static_local_example() {
static int x = ext();
return x;
}
# gcc7.3
movzx eax, BYTE PTR guard variable for static_local_example()::x[rip]
test al, al
je .L11
# x86 loads are always acquire-loads, other ISAs would need a barrier after loading the guard
mov eax, DWORD PTR static_local_example()::x[rip]
ret
A static function pointer (at class or file scope, not function) is worth considering, but replacing a conditional branch with an unconditional indirect call is unlikely to be a win. And then you have function-call overhead (clobbered registers, arg-passing). Compilers would normally try to devirtualize back into a conditional branch as an optimization!
Footnote 1: If your condition had been acpi_version == 4
, then MIPS could have saved one instruction from storing a 0 / 1 result. Instead of comparing into flags, it has compare-into-register, and branch instructions that compare against zero or a register, and a register that already reads as zero. Even on x86, comparing for zero / non-zero saves a byte of code-size if the value is already in a register (test eax,eax
vs. cmp eax,4
). It would save more if it was the result of an ALU instruction (so ZF would already be set), but it's not.
But most other architectures compare into flags, and you can't load from memory directly into flags. So you'd only want to store a static bool
result if acpi_version
was expensive to compare, like an integer wider than a register such as __int128
, or int64_t
on a 32-bit machine.
Footnote 2: Don't use a global register variable for acpi_version
; that would be silly. If it's used everywhere, then hopefully link-time optimization can do a good job at hoisting the comparison out of things.
Branch prediction + speculative execution means that the CPU doesn't actually have to wait for a load result when you branch on it, and if you read it all the time it will stay hot in L1d cache anyway. (Speculative execution means that control dependencies aren't part of the critical path, assuming correct branch prediction)
PS: if you made it this far and understood everything, then you should consider using binary patching like Linux does for a few frequently-checked conditions. If not, you probably shouldn't!