Producing good add with carry code from clang

Question

I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.

First some typedefs:

typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;

And a "result" type:

struct Result
{
  unsigned_word lo;
  unsigned_word hi;
};

The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:

Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
  unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
  unsigned_128 r1 = n1 + n2;
  x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
  x.hi = r1 >> 64;
  return x;
}

This actually gets inlined quite nicely like so:

movq    8(%rsp), %rsi
movq    (%rsp), %rbx
addq    24(%rsp), %rsi
adcq    16(%rsp), %rbx

Now, instead I've written a simpler function using the clang multi-precision primatives, as below:

static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
  return x;
}

This produces the following assembly:

movq    24(%rsp), %rsi
movq    (%rsp), %rbx
addq    16(%rsp), %rbx
addq    8(%rsp), %rsi
adcq    $0, %rbx

In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.

This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.

The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.

So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.

This is one of those corner cases that compilers currently suck at. If you really care that much, you'll need to use inline assembly. GMP does a lot of this carry-propagation stuff and it's all in assembly. — Mysticial, Nov 13 '15 at 15:48
I already asked a bounty question on this. http://stackoverflow.com/questions/29029572/multi-word-addition-using-the-carry-flag I suspect you will find the same answer (or lack thereof) that I did. — Z boson, Nov 15 '15 at 15:40

score 23 · Answer 1 · edited May 23 '17 at 12:15

There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.

Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.

The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).

Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.

Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.

Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.

In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.

Here is that code reposted.

#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
 __asm__ __volatile__ ( \
 "addq %[v1], %[u1] \n" \
 "adcq %[v2], %[u2] \n" \
 "adcq %[v3], %[u3] \n" \
 "adcq %[v4], %[u4] \n" \
 : [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
 : [v1]  "r" (Y1), [v2]  "r" (Y2), [v3]  "r" (Y3), [v4]  "r" (Y4))

If you want to explicitly load the values from memory you can do something like this

//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
     "movq (%[in]), %%rax\n"
     "addq %%rax, %[out]\n"
     "movq 8(%[in]), %%rax\n"
     "adcq %%rax, 8%[out]\n"
     "movq 16(%[in]), %%rax\n"
     "adcq %%rax, 16%[out]\n"
     "movq 24(%[in]), %%rax\n"
     "adcq %%rax, 24%[out]\n"
     : [out] "=m" (dst)
     : [in]"r" (src)
     : "%rax"
     );

That produces nearlly identical assembly as from the following function in ICC

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
        _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.

So what I'm looking for is code that I could generalize to any length

To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.

#include <x86intrin.h>
#include <inttypes.h>

#define LEN 4  // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...

static unsigned char c = 0;

template<int START, int N>
struct Repeat {
    static void add (uint64_t *x, uint64_t *y) {
        c = _addcarry_u64(c, x[START], y[START], &x[START]);
        Repeat<START+1, N>::add(x,y);
    }
};

template<int N>
    struct Repeat<LEN, N> {
    static void add (uint64_t *x, uint64_t *y) {}
};


void sum_unroll(uint64_t *x, uint64_t *y) {
    Repeat<0,LEN>::add(x,y);
}

Assembly from ICC

xorl      %r10d, %r10d                                  #12.13
movzbl    c(%rip), %eax                                 #12.13
cmpl      %eax, %r10d                                   #12.13
movq      (%rsi), %rdx                                  #12.13
adcq      %rdx, (%rdi)                                  #12.13
movq      8(%rsi), %rcx                                 #12.13
adcq      %rcx, 8(%rdi)                                 #12.13
movq      16(%rsi), %r8                                 #12.13
adcq      %r8, 16(%rdi)                                 #12.13
movq      24(%rsi), %r9                                 #12.13
adcq      %r9, 24(%rdi)                                 #12.13
setb      %r10b

Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).

The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better

void foo(uint64_t *dst, uint64_t *src)
{
    __asm (
        "movq (%[in]), %%rax\n"
        "addq %%rax, (%[out])\n"
        "movq 8(%[in]), %%rax\n"
        "adcq %%rax, 8(%[out])\n"
        "movq 16(%[in]), %%rax\n"
        "addq %%rax, 16(%[out])\n"
        "movq 24(%[in]), %%rax\n"
        "adcq %%rax, 24(%[out])\n"
        :
        : [in] "r" (src), [out] "r" (dst)
        : "%rax"
    );
}

It would be nice to have things like division with remainder, add with carry, bit rotates, etc... — Jason, Nov 16 '15 at 23:40
@Jason, yeah, I have been wondering if C could be extended for such things. I like C because I find it maps closely to assembly well without writing assembly. Some claim C is totally abstract with no connection to hardware. Of course that's not true. E.g it assumes a binary machine (it won't work for a ternary computer) and that machines may have different words sizes (char, short, int, ...). C produces ideal assembly for a "simple computer" such as the one define in Hackers Delight with no flags register. It's strange that C has the complex type but no SIMD type like OpenCL C does. — Z boson, Nov 17 '15 at 09:38
@Jason: compilers have been smart enough for a long time to CSE an `x/y; x%y` into a single `div` instruction, using both results. Rotate is more problematic, but these days there's an idiom for rotates that compiles to a single rotate instruction without any undefined behaviour even for count=0 or count=type-width (the masking optimizes away). http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c. But still, I agree that C makes some things unnecessarily difficult or impossible without resorting to compiler-specific extensions. — Peter Cordes, Nov 26 '15 at 12:14
@PeterCordes, good link about rotate! [But as to divide the compiler is not always so smart. Sometimes you have to give it a little help](http://stackoverflow.com/questions/22556599/conditional-tests-in-primality-by-trial-division). — Z boson, Nov 26 '15 at 12:20
@PeterCordes, i'm going to digress a bit here. Can you tell me why C has a complex type and not a SIMD type? What I mean is IMHO C should add types which are likely to map to hardware. I'm not aware of hardware that has built in complex math support. But most hardware has built-in SIMD hardware. Why does C not have e.g. float4? That makes a lot more sense than the complex type to me. Not to mention that with SIMD the ideal way to pack it is xxxxyyyy....and so the built-in complex type is not SIMD efficient. Leave custom non-hardware types to C++. — Z boson, Nov 26 '15 at 12:29
@PeterCordes, I should ask that as a question on SO but I don't know how to formulate the question to get past SO strict filter. — Z boson, Nov 26 '15 at 12:29
@Zboson: http://stackoverflow.com/questions/27977522/how-are-c-data-types-supported-directly-by-most-computers apparently made it past the filters, so maybe try to take that angle. Interesting point. Now that you mention it, maybe C's complex types made it in because of Fortran having them. The standard library could still have complex math functions with complex args passed in as two separate doubles, but maybe it's less efficient that way in some cases? — Peter Cordes, Nov 26 '15 at 12:37
GNU C's vector types are supported across a few compilers, with architecture-independent support for some ops, but not good shuffles. I wouldn't be surprised if there have been proposals to add vector types to an ISO C standard. Apparently none that got in, though. Maybe different architectures having such different selections of shuffles available has been a problem? Lack of support for writing shuffles was the thing I noticed when dabbling a tiny bit with GNU C vectors. — Peter Cordes, Nov 26 '15 at 12:40
@PeterCordes, yeah I started writing a question about this and even referenced that link you mentioned about types but I abandon the question because I though it would be too broad. I agree that shuffles are one of the weak points in GCC vector types. I think Clang does it better. It can emulate OpenCL vector types which I think are great. e.g `v.xyzw` — Z boson, Nov 26 '15 at 12:47
Forgive my ignorance... Why are your `Y`'s using `"r"`? The Intel manual says `r/m`, so shouldn't `"g"` work as well? Maybe even better since the compiler does not need to generate a move to/from a register for `dest`. — jww, Aug 20 '17 at 21:57

score 2 · Answer 2 · answered May 09 '18 at 23:06

On Clang 6, both __builtin_addcl and __builtin_add_overflow produce the same, optimal disassembly.

Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
  return x;
}

Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
  carryout = __builtin_add_overflow(hi1, carryout, &hi1);
  __builtin_add_overflow(hi1, hi2, &x.hi);
  return x;
}

Assembly for both:

add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret

Ok, for adding two 64bit integers, I also got optimal code with that. However, I did not manage to get anything optimal for adding 3 or more limbs. — chtz, May 09 '18 at 23:18

score 1 · Answer 3 · answered May 09 '18 at 22:54

Starting with clang 5.0 it is possible to get good results using __uint128_t-addition and getting the carry bit by shifting:

inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
    __uint128_t s = __uint128_t(a) + b + c;
    a = s;
    return s >> 64;
}

In many situations clang still does strange operations (I assume because of possible aliasing?), but usually copying one variable into a temporary helps.

Usage examples with

template<int size> struct LongInt
{
    uint64_t data[size];
};

Manual usage:

void test(LongInt<3> &a, const LongInt<3> &b_)
{
    const LongInt<3> b = b_; // need to copy b_ into local temporary
    uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
    uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
    uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}

Generic solution:

template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
    __uint128_t c = __uint128_t(a.data[0]) + b.data[0];
    for(int i=1; i<size; ++i)
    {
        c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
        a.data[i] = c;
    }
}

Godbolt Link: All examples above are compiled to only mov, add and adc instructions (starting with clang 5.0, and at least -O2).

The examples don't produce good code with gcc (up to 8.1, which at the moment is the highest version on godbolt). And I did not yet manage to get anything usable with __builtin_addcll ...

Producing good add with carry code from clang

3 Answers3

Linked