0

I've implemented multi-precision addition using the following code:

    bool carry{};
    std::array<uint64_t, N> r{};

    for (auto i = 0; i < N; ++i) {
       uint64_t aa = a[i];
       __uint128_t res = static_cast<__uint128_t>(aa) + b[i] + carry;
       carry = res >> 64;
       r[i] = res;
    }

And clang++6.0 produced the following assembly:

400a49: 4c 01 c1                add    %r8,%rcx
400a4c: 66 49 0f 38 f6 c1       adcx   %r9,%rax
400a52: 66 49 0f 38 f6 f2       adcx   %r10,%rsi
400a58: 66 48 0f 38 f6 d7       adcx   %rdi,%rdx

Can anyone explain why clang choose to use adcx over adc? As far as I can tell the boto have the same execution time but the encoding of adc is 3 bytes vs 6 for adcx.

Update: I played with it a bit more and it seems that the behavior is quite random. if args are passed as const reference I get adcx https://godbolt.org/g/noFZNS if I pass by value I get adc:

https://godbolt.org/g/RkBWhV

and if the code is not inside a function, just inlined in main, its a total mess.

Ilya Lesokhin
  • 329
  • 2
  • 7
  • What is `carry`, what is `T`, what is `r`, .... [MCVE] may be helpful, as you are asking about particular machine code produced. Also it looks like you posted C source of subtraction, not addition? And from your assembly piece it is not clear why `adc` was not used. Maybe the `for` looping mechanism is using zero flag, so `adcx` helps to preserve it. (depends on `N` definition, if it's `constexpr / define` or variable) – Ped7g Jun 03 '18 at 07:07
  • 2
    Looks like a missed-optimization to me. I think `adc` is a better choice. – Peter Cordes Jun 03 '18 at 07:16
  • @Ped7g you are right I posted the wrong code (sub instead of add). N is 4. the entire code is quite long, so I wasn't sure how to post it. – Ilya Lesokhin Jun 03 '18 at 07:29
  • Well, in theory `adcx` and `adox` can be [interleaved](https://stackoverflow.com/a/29748041/2189500). Perhaps this is a side effect of that? – David Wohlferd Jun 03 '18 at 08:25
  • seems [clang 3.8](https://stackoverflow.com/q/33690791/995714) and [gcc](https://stackoverflow.com/q/6659414/995714) produce `adc` correctly – phuclv Jun 03 '18 at 10:50
  • You should report this on https://bugs.llvm.org/buglist.cgi. Until clang knows how to actually interleave with ADOX, it's pointless to spend extra code-size on ADCX. (I could imagine a rare case where preserving other flags was useful, and recent Intel CPUs seem to be very efficient at partial-flags stuff without even needing a merging uop.) – Peter Cordes Jun 03 '18 at 15:59

1 Answers1

2

This looks like a missed-optimization to me. I think adc is a better choice. On Skylake, they have equal performance characteristics according to some quick throughput testing (with xor eax,eax / times 4 adcx eax,edx in a loop). Agner Fog strangely doesn't list adox/adcx in his instruction tables (http://agner.org/optimize/), on SKL ADC/ADCX/ADOX are all 1 uop for p0/p6, with 1c latency.

If anything, writing all flags instead of just CF is less likely to lead to performance problems.

You should report this on https://bugs.llvm.org/buglist.cgi.

Until clang knows how to actually interleave with ADOX when there are two parallel dep chains, it's pointless to spend extra code-size on ADCX.

I could imagine a rare case where preserving other flags was useful, and recent Intel CPUs seem to be very efficient at partial-flags stuff without even needing a merging uop. But that's very niche and not what's going on here (add clobbers all flags).

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Honestly I don't understand the value of `adcx/adox` when `adc` has a single-cycle latency. If you try to keep two independent chains of multiprecision additions going, you run into load/store bandwidth limits with at most two load units and one store unit. Well `adcx/adox` were invented before `adc` was turned single-cycle, but AFAIK they were not *implemented* in CPUs before. So at best it would be a communications failure inside Intel. – EOF Jun 03 '18 at 18:27
  • 3
    @EOF: The intended use-case is extended-precision multiply, not *just* adding. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf shows that `mulx` saves some instructions, but `mulx` + `adcx` / `adox` saves even more for a 512x512 or 512x64 bit multiply, avoiding the need for `adc reg, 0` to save carry before an `add`, because you have to deal with the high/low halves. It's not so much the latency that's important, but the total number of instructions for uop throughput and OoO window size. (`mulx` uses p1+p5, `adc/ox` uses p06). – Peter Cordes Jun 03 '18 at 20:39