16

GCC has 128-bit integers. Using these I can get the compiler to use the mul (or imul with only one operand) instructions. For example

uint64_t x,y;
unsigned __int128 z = (unsigned __int128)x*y;

produces mul. I have used this to create a 128x128 to 256 function (see the end of this question, before the update, for code for that if you're interested).

Now I want to do 256-bit addition and I have not found a way to get the compiler to use ADC except by using assembly. I could use an assembler but I want inline functions for efficiency. The compiler already produces an efficient 128x128 to 256 function (for the reason I explained at the start of this question) so I don't see why I should rewrite this in assembly as well (or any other functions which the compiler already implements efficiently).

Here is the inline assembly function I have come up with:

#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
 __asm__ __volatile__ ( \
 "addq %[v1], %[u1] \n" \
 "adcq %[v2], %[u2] \n" \
 "adcq %[v3], %[u3] \n" \
 "adcq %[v4], %[u4] \n" \
 : [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
 : [v1]  "r" (Y1), [v2]  "r" (Y2), [v3]  "r" (Y3), [v4]  "r" (Y4)) 

(probably not every output needs a early clobber modifier but I get the wrong result without at least the last two). (editor's note: the last output isn't written until all inputs have been read, and would be safe to not declare as early-clobber.)

And here is a function which does the same thing in C

void add256(int256 *x, int256 *y) {
    uint64_t t1, t2;
    t1 = x->x1; x->x1 += y->x1;
    t2 = x->x2; x->x2 += y->x2 + ((x->x1) < t1);
    t1 = x->x3; x->x3 += y->x3 + ((x->x2) < t2);
                x->x4 += y->x4 + ((x->x3) < t1);
}

Why is assembly necessary for this? Why can't the compiler compile the add256 function to use the carry flags? Is there a way to coerce the compiler to do this (e.g. can I change add256 so that it does this)? What is someone suppose to do for a compiler which does not support inline assembly (write all the functions in assembly?) Why are there no intrinsic for this?

Here is the 128x128 to 256 function

void muldwu128(int256 *w, uint128 u, uint128 v) {
   uint128 t;
   uint64_t u0, u1, v0, v1, k, w1, w2, w3;

   u0 = u >> 64L;
   u1 = u;
   v0 = v >> 64L;
   v1 = v;

   t = (uint128)u1*v1;
   w3 = t;
   k = t >> 64L;

   t = (uint128)u0*v1 + k;
   w2 = t;
   w1 = t >> 64L;
   t = (uint128)u1*v0 + w2;
   k = t >> 64L;

   w->hi = (uint128)u0*v0 + w1 + k;
   w->lo = (t << 64L) + w3;

}

Some type defines:

typedef          __int128  int128;
typedef unsigned __int128 uint128;

typedef union {
    struct {
        uint64_t x1;
        uint64_t x2;
         int64_t x3;
         int64_t x4;
    };
    struct {
        uint128 lo;
         int128 hi;
    };
} int256;

Update:

My question is largely a duplicate of these questions:

  1. get-gcc-to-use-carry-logic-for-arbitrary-precision-arithmetic-without-inline-assembly
  2. efficient-128-bit-addition-using-carry-flag
  3. multiword-addition-in-c.

Intel has a good article (New Instructions Support Large Integer Arithmetic) which discusses large integer arithmetic and the three new instructions MULX, ADCX, ADOX. They write:

intrinsic definitions of mulx, adcx and adox will also be integrated into compilers. This is the first example of an “add with carry” type instruction being implemented with intrinsics. The intrinsic support will enable users to implement large integer arithmetic using higher level programming languages such as C/C++.

The intrinsics are

unsigned __int64 umul128(unsigned __int64 a, unsigned __int64 b, unsigned __int64 * hi);
unsigned char _addcarry_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);
unsigned char _addcarryx_u64(unsigned char c_in, unsigned __int64 a, unsigned __int64 b, unsigned __int64 *out);

Incidentally, MSVC already has a _umul128 intrinsic. So even though MSVC does not have __int128 the _umul128 intrinsic can be used to generate mul and therefore 128 bit multiplication.

MULX is in BMI2 (Haswell). The ADCX and ADOX instructions are available since Broadwell, as the ADX extension. It's too bad there is no intrinsic for ADC which has been available since the 8086 in 1979. That would solve the inline assembly problem.

(Editor's note: Intel's intrinsics guide does define _addcarry_u64 for baseline x86-64, but perhaps not all compilers implemented it. However, gcc typically compiles it and/or _addcarryx inefficiently, often spilling CF to an integer with setc instead of ordering instructions better.)

GCC's __int128 codegen will use mulx if BMI2 is enabled (e.g. using -mbmi2 or -march=haswell).

Edit:

I tried the Clang's add with carry builtins as suggested by Lưu Vĩnh Phúc

void add256(int256 *x, int256 *y) {
    unsigned long long carryin=0, carryout;
    x->x1 = __builtin_addcll(x->x1, y->x1, carryin, &carryout); carryin = carryout;
    x->x2 = __builtin_addcll(x->x2, y->x2, carryin, &carryout); carryin = carryout;
    x->x3 = __builtin_addcll(x->x3, y->x3, carryin, &carryout); carryin = carryout;
    x->x4 = __builtin_addcll(x->x4, y->x4, carryin, &carryout);  
}

but this does not generated ADC and it's more complicated than I expect.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
Z boson
  • 29,230
  • 10
  • 105
  • 195
  • The simple answer is probably that nobody has had this itch bad enough for it to be cured. The compilers mostly do math well enough so hardly anybody needs "more", and there are already multiprecision math packages built that serve well enough for most folks. Here's your chance to improve GCC and save the next guy from this complaint :-} – Ira Baxter Mar 13 '15 at 10:53
  • 1
    GCC recently added integer overflow builtins. Perhaps you could trick the compiler into generating something nice with those: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html. – Ulfalizer Mar 13 '15 at 12:47
  • GCC 4.9.1 does spit out `adc`s for me when adding `unsigned __int128`s by the way. – Ulfalizer Mar 13 '15 at 12:50
  • 1
    @Ulfalizer yes but only for the carry between the 64 bit parts of the 128 bit additions. I couldn't get it to propagate carry between the 128 bit parts using `adc`. – Jester Mar 13 '15 at 13:55
  • @Jester: Yeah, maybe it's not smart enough to chain multiple `adc`s together... – Ulfalizer Mar 13 '15 at 14:02
  • @IraBaxter, I find it hard to believe that nobody has had the itch to optimize for multi-word addition with carry. It's not like the FLAGS registers is new technology. I'm more inclined to think that it's not easy to do in C. Maybe there are consequences to using a conditional test to calculate a carry and using FLAGS instead (not 1 to 1). – Z boson Mar 16 '15 at 08:19
  • @lurker, what is needed is `MUL` (or `IMUL` with one operand for signed) and `ADC` (the compiler does the rest). The reason I started my question with `__int128`was to show that that's sufficient for `MUL`. You only need twice the natural word size (so `int64_t` for 32-bit mode). The only thing holding back making int512 multiplication is 256-bit addition (with `ADC`). So if I could get the compiler to generate `ADC` then I can do integer larger than int128 arithmetic efficiently in C without using assembly. – Z boson Mar 16 '15 at 08:22
  • @Ulfalizer, I looked breifly into `__builtin_uaddll_overflow` with clang (since gcc 4.9 does not support it). It's fine to get the carry but I'm still not sure how to get it to generate `ADC`. I'll keep playing with those builtins (thanks!). I did find an intrinsic for the `ADCX` instruction `_addcarry_u64` which would do what I want (and is even better) but ADCX is only available for Broadwell whereas `ADC` has been around since 1979. – Z boson Mar 16 '15 at 13:02
  • 2
    @Zboson: "add with carry" code generation requires the compiler to understand you are in effect doing multiprecision arithmetic (no obvious notation to say so, just because you named a type int1024 doesn't mean the compiler gets your intent), or for you to express that an add produced *two* results, a sum and carry, and that you want to use that carry in another add operation, e.g, loworderwordA+=lowerorderwordB; highorderwordA+=highorderwordB+lastcarry(); There is no notation in C that allows you express the last carry. ... – Ira Baxter Mar 17 '15 at 11:36
  • .. Could somebody make a nonstandard extension to GCC to do this? Sure, but hasn't happened yet. And lots of complications. First, the standard doesn't define "carry". Second, your machine (arguably) may be weird; Unisys mainframes don't add, they subtract (less logic per bit), so you might get a borrow bit, not a carry; now what would "carry" mean? What happens if my statement includes "two" plus operators; carry refers to which? What happens if the code generator produces an instruction invisible to you, that affects the carry bit (a – Ira Baxter Mar 17 '15 at 11:43
  • 1
    @Zboson: as far as the itch is concerned, people have built just fine multiprecision packages. (GNU has quite a nice one, I hear, but I don't remember its name). Apparantly they weren't offended enough to fix the compiler, but their implementation is hard to beat. So this is really a question of, "is it worth it to explicitly support this feature, for something pretty rarely used?" Usually the answer is no, especially if there in a workable alternative. – Ira Baxter Mar 17 '15 at 11:50
  • 1
    @IraBaxter, I'm wondering if the C designers considered this. E.g. PDP-11 was 16-bit and had `ADC`. So having efficient multi-word addition then would seem to me something that was a priority. Writing inline assembly is obviously not portable which is one of the goals of C. It's strange to me that C can't do this. It's like they assumed the CPU did not have a FLAGS register. – Z boson Mar 17 '15 at 11:51
  • The point is that the C langauge didn't assume there was a flags register or any kind of condition bits. You can define a perfectly nice machine that has no flag bits; just fuse the compare with a conditional jmp, that's what you pretty much want the machine to do anyway. The existence of FLAGS is statistical (lots of PCs on the planet) but not guaranteed. Any solution that assumes FLAGS is machine-specific. That's OK as long as you understand that, but it automatically makes it non-portable. Intel's new instructions and added intrinsic to support them are a perfect example of this. – Ira Baxter Mar 17 '15 at 11:55
  • @IraBaxter, I mean assuming a machine may have FLAGS (or is likely to). C should not just be portable it should also be efficient (otherwise you have to use assembly). But I guess this case is so rare as you say that efficiency was not relevant. I'm just surprised to find a case where C fails so clearly in efficiency for something so old. I am used to that with SIMD but even then auto-vectorization works fine sometimes. – Z boson Mar 17 '15 at 13:18
  • You can't have it both ways. The C langauge is designed to be largely hardware independent. You can't have that and insist on carry bit support. – Ira Baxter Mar 17 '15 at 16:40
  • @IraBaxter, in this case I agree that I can't have it both ways. I wonder how many other cases there are such as this. – Z boson Mar 18 '15 at 08:16
  • Probably lots and lots (I can imagine at least one case per each specific machine instruction!). The law of diminishing returns suggests these won't be fixed. – Ira Baxter Mar 18 '15 at 11:13
  • 1
    Clang has [`__builtin_addc`](http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-builtins) for multiprecision arithmetics (http://programmers.stackexchange.com/a/199542/98103). But actually you don't need to worry about that and can simply write the code in C because Clang and ICC will optimize those carry-by-comparisons into `adc` as you can see [here](http://goo.gl/Qvapc1) – phuclv Mar 19 '15 at 09:37
  • if you want signed multiprecision then all the "limbs" in the type must be unsigned except the highest significant one – phuclv Mar 19 '15 at 09:40
  • @LưuVĩnhPhúc, looks like you got a winning answer! Why don't you write it up? So Clang can do this but GCC can't? – Z boson Mar 19 '15 at 09:48
  • 1
    @LưuVĩnhPhúc, the assembly from clang does not seem to be optimal. It should be 1 `add` and 3 `adc` but I count 6 `add` and 3 `adc`. Additionally, the `adc` only add zero + the carry. That's not optimal. – Z boson Mar 19 '15 at 09:52
  • @LưuVĩnhPhúc, I tried using ` __builtin_addcll` (e.g. `x->x1 = __builtin_addcll(x->x1, y->x1, carryin, &carryout); carryin = carryout;`) but the results are not really what I expect. Maybe I'm not doing something right. – Z boson Mar 19 '15 at 10:06
  • @LưuVĩnhPhúc, I edited the end of my answer with a function using `__builtin_addcll`. It does not generate `adc`. – Z boson Mar 19 '15 at 10:50
  • 1
    Related: http://stackoverflow.com/questions/15696540/get-gcc-to-use-carry-logic-for-arbitrary-precision-arithmetic-without-inline-ass – Iwillnotexist Idonotexist Mar 19 '15 at 12:42
  • 4
    The makers of GMP have given up on compelling GCC (albeight an antique version) to emit `adc`: https://gmplib.org/manual/Assembly-Carry-Propagation.html – Iwillnotexist Idonotexist Mar 19 '15 at 12:44
  • @IwillnotexistIdonotexist, exactly (in regards to the GMP link), it seems the target hardware for C was like the RISC processor in Hacker's delight (which has no FLAGS register). And I don't believe it makes any sense to design a practical language without some target hardware (or set of hardware). But C was designed before RISC processor exist (I think) so this is strange. – Z boson Mar 19 '15 at 12:54
  • I've never tried looking carefully at Clang and ICC's output before. Indeed they don't generate optimal assembly. Maybe we need to wait until compilers are smarter, otherwise the only choice is inline assembly – phuclv Mar 19 '15 at 14:31
  • @LưuVĩnhPhúc, the only option for `adc` (in this case) is inline assembly. But for Broadwell and beyond intrinsics such as `_addcarry_u64` can be used and if you look at the example source code in Intel's document I linked to you which compares 512x512 multiplication using `mul` and `adc` vs. `mulx` and `adcx` you can see that the new instructions are quite an improvement. – Z boson Mar 19 '15 at 15:26
  • 2
    Unfortunately, the situation in straight C is actually worse than you think: your posted C code for add256 doesn't work in some cases. Specifically, examine the case where x->x1 = 1, x->x2=x->x3=x->x4 = 0, and y->y1=y->y2=y->y3=y->y4 = 0xFFffFFffFFffFFff. The carry should propagate to make the result all zeroes, but t2 ends up being zero, making it impossible to propagate further. The only possible fixes in straight C are both ugly and slow. – user3535668 Oct 06 '16 at 19:14
  • @user3535668 It's not that ugly, just define your own addcarry: `inline unsigned char notbuiltin_addcarry_u64(unsigned char c, unsigned long long a, unsigned long long b, unsigned long long *dest) { auto sum = a + b; unsigned char carry = sum < a; if (c) carry += (++sum == 0); *dest = sum; return carry; }` See https://godbolt.org/z/az96eG – jorgbrown Jan 04 '21 at 10:28

1 Answers1

4

I found a solution with ICC 13.0.01 using the _addcarry_u64 intrinsic

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
        _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

produces

L__routine_start_add256_0:
add256:
        xorl      %r9d, %r9d                                    #25.9
        movq      (%rsi), %rax                                  #22.9
        addq      %rax, (%rdi)                                  #22.9
        movq      8(%rsi), %rdx                                 #23.9
        adcq      %rdx, 8(%rdi)                                 #23.9
        movq      16(%rsi), %rcx                                #24.9
        adcq      %rcx, 16(%rdi)                                #24.9
        movq      24(%rsi), %r8                                 #25.9
        adcq      %r8, 24(%rdi)                                 #25.9
        setb      %r9b                                          #25.9
        ret                                                     #26.1

I compiled with -O3. I don't know how to enable adx with ICC. Maybe I need ICC 14?

That's exactly 1 addq and three adcq like I expect.

With Clang the result using -O3 -madx is a mess

add256(uint256*, uint256*):                  # @add256(uint256*, uint256*)
movq    (%rsi), %rax
xorl    %ecx, %ecx
xorl    %edx, %edx
addb    $-1, %dl
adcq    %rax, (%rdi)
addb    $-1, %cl
movq    (%rdi), %rcx
adcxq   %rax, %rcx
setb    %al
movq    8(%rsi), %rcx
movb    %al, %dl
addb    $-1, %dl
adcq    %rcx, 8(%rdi)
addb    $-1, %al
movq    8(%rdi), %rax
adcxq   %rcx, %rax
setb    %al
movq    16(%rsi), %rcx
movb    %al, %dl
addb    $-1, %dl
adcq    %rcx, 16(%rdi)
addb    $-1, %al
movq    16(%rdi), %rax
adcxq   %rcx, %rax
setb    %al
movq    24(%rsi), %rcx
addb    $-1, %al
adcq    %rcx, 24(%rdi)
retq

Without enabling -madx in Clang the result is not much better.

Edit: Apperently MSVC already has _addcarry_u64. I tried it and it's as good as ICC (1x add and 3x adc).

Z boson
  • 29,230
  • 10
  • 105
  • 195
  • ADCX is in ADX, not BMI2, so ICC can't emit ADCX when I tried. GCC seems to be not able to understand the intrinsic `_addcarry_u64` https://gcc.godbolt.org/ – phuclv Mar 25 '15 at 08:13
  • @LưuVĩnhPhúc, you're right. I don't know how to enable `ADX` with ICC (`-madx`) does not work with ICC 13. However, I can enable it in Clang and the result from Clang is still a mess. – Z boson Mar 25 '15 at 08:20
  • 1
    By the way, on GCC added `_addcarry_u64()` in 5.1. But it's bugged. As of 5.2, it's still bugged: http://coliru.stacked-crooked.com/a/28a776c89af0588c It seems that anything that involves saving the carry-bit across loop iterations is borked. – Mysticial Nov 16 '15 at 17:05
  • @Mysticial, good to know. Just to confirm, with ICC15 you don't observe the same thing as I did with ICC13? ICC13 produced `adc` just like MSVC. – Z boson Nov 17 '15 at 08:30
  • ICC15 never generates adcx/adox for `_addcarry_u64`. You need to use `_addcarryx_u64`. The same applies to VS2015 except that it also never generates adox. – Mysticial Nov 17 '15 at 15:13
  • @Mysticial, okay, thanks again. I see I misread (something I do too much lately) your previous comment. I have to correct an answer then. – Z boson Nov 17 '15 at 20:37
  • 1
    Btw, GCC 5.3 fixes the bug. But the code that it generates for those intrinsics is so hilariously bad that you might as well just avoid them. – Mysticial Apr 22 '16 at 23:30
  • 2
    Spoke too soon. `_subborrow_u64` seems to have inconsistent behavior between MSVC/ICC and GCC. MSVC and ICC does `src1 - src2`. GCC does `src2 - src1`. Intel's intrinsic reference says `src2 - src1`. lol... – Mysticial Apr 23 '16 at 05:50
  • For GCC, maybe you can use the integer overflow builtins? https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html – Till Kolditz Sep 07 '16 at 15:30
  • Just for the record: if you want to use the _addcarry_u64 intrinsic, `#include ` to get it. The `-madx` flag shouldn't matter because adc has existed since AMD's very first 64-bit x86 CPUs. – jorgbrown Jan 04 '21 at 09:08