I've implemented multi-precision addition using the following code:
bool carry{};
std::array<uint64_t, N> r{};
for (auto i = 0; i < N; ++i) {
uint64_t aa = a[i];
__uint128_t res = static_cast<__uint128_t>(aa) + b[i] + carry;
carry = res >> 64;
r[i] = res;
}
And clang++6.0 produced the following assembly:
400a49: 4c 01 c1 add %r8,%rcx
400a4c: 66 49 0f 38 f6 c1 adcx %r9,%rax
400a52: 66 49 0f 38 f6 f2 adcx %r10,%rsi
400a58: 66 48 0f 38 f6 d7 adcx %rdi,%rdx
Can anyone explain why clang choose to use adcx over adc? As far as I can tell the boto have the same execution time but the encoding of adc is 3 bytes vs 6 for adcx.
Update: I played with it a bit more and it seems that the behavior is quite random. if args are passed as const reference I get adcx https://godbolt.org/g/noFZNS if I pass by value I get adc:
and if the code is not inside a function, just inlined in main, its a total mess.