Two uint64_t numbers multiplication and division on coefficient in the C program are wrong compiled with GCC

Question

It is well known that fixed point arithmetics is equavalent to integer arithmetics with some constraints. In this way I caught the issue. Let's see:

#include <stdio.h>
#include <stdint.h>

#define USE_C_FUNC 1

#if USE_C_FUNC != 0
uint64_t fixmd(uint64_t a, uint64_t b)
{
    uint64_t c = a * b / 10000000000LL;
    return c;
}
#else
uint64_t fixmd(uint64_t a, uint64_t b)
{
    uint64_t c;

    asm (
    "mov %1, %%rax\n"
    "mul %2\n"
    "mov $10000000000, %%r8\n"
    "div %%r8\n"
    "mov %%rax, %0\n"
    : "=r" (c) 
    : "r" (a), "r" (b)
    : "rax", "rdx", "r8"
    );
    return c;
}
#endif

void main(void)
{
    uint64_t x = 12589254118LL;
    uint64_t y = fixmd(x, x);
    printf("x=0x%llX=%llu, y=0x%llX=%llu\n", x, x, y, y);
}

In case #define USE_C_FUNC 1 program prints WRONG result:

x=0x2EE60C5E6=12589254118, y=0x410F8719=1091536665

In case #define USE_C_FUNC 0 program (with inline assembly) prints RIGHT result:

x=0x2EE60C5E6=12589254118, y=0x3B0AB8254=15848931924

I had compiling this example without any optimization (with gcc option -O0).

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 10.2.1-6' --with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-10 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-10-Km9U7s/gcc-10-10.2.1/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-Km9U7s/gcc-10-10.2.1/debian/tmp-gcn/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.1 20210110 (Debian 10.2.1-6)

objdump -D uint64 for WRONG result for function fixmd() is:

0000000000001135 <fixmd>:
    1135:   55                      push   %rbp
    1136:   48 89 e5                mov    %rsp,%rbp
    1139:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
    113d:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
    1141:   48 8b 45 e8             mov    -0x18(%rbp),%rax
    1145:   48 0f af 45 e0          imul   -0x20(%rbp),%rax
    114a:   48 ba bf d5 ed bd ce    movabs $0xdbe6fecebdedd5bf,%rdx
    1151:   fe e6 db 
    1154:   48 f7 e2                mul    %rdx
    1157:   48 89 d0                mov    %rdx,%rax
    115a:   48 c1 e8 21             shr    $0x21,%rax
    115e:   48 89 45 f8             mov    %rax,-0x8(%rbp)
    1162:   48 8b 45 f8             mov    -0x8(%rbp),%rax
    1166:   5d                      pop    %rbp
    1167:   c3                      retq

objdump -D uint64 for GOOD result for function fixmd() is:

0000000000001135 <fixmd>:
    1135:   55                      push   %rbp
    1136:   48 89 e5                mov    %rsp,%rbp
    1139:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
    113d:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
    1141:   48 8b 4d e8             mov    -0x18(%rbp),%rcx
    1145:   48 8b 75 e0             mov    -0x20(%rbp),%rsi
    1149:   48 89 c8                mov    %rcx,%rax
    114c:   48 f7 e6                mul    %rsi
    114f:   49 b8 00 e4 0b 54 02    movabs $0x2540be400,%r8
    1156:   00 00 00 
    1159:   49 f7 f0                div    %r8
    115c:   48 89 c1                mov    %rax,%rcx
    115f:   48 89 4d f8             mov    %rcx,-0x8(%rbp)
    1163:   48 8b 45 f8             mov    -0x8(%rbp),%rax
    1167:   5d                      pop    %rbp
    1168:   c3                      retq

In the code for fixmd() in C the problem is quite clean: gcc uses signed multiplication (imul) for unsigned values and substitute division with multiplication on the coefficient (with overflow) and binary shifting. Seems, it is the bug, is not it?

The bug is in your expectation. `gcc` compiles correctly according to the C language rules, and is even nice enough to optimize your code by not using a slow division when it can easily compute the result of a division by a constant by instead multiplying with the multiplicative inverse. Remember to thank your compiler! — EOF, May 27 '21 at 15:49
Thank you. But I cannot imagine how I can thank my compiler: I am not sure it understand me... — Tedim, May 27 '21 at 19:22
Anyway the explanation of this problem was not quite satisfying because such mistakes are difficult to discover. I think it is not the standard or my expectation is guilty but hardware. Today Memory Management/Protection Unit (MMU/MPU) is the part of many processors and allow isolate program within its allocated memory and generate signal SIGSEGV when the memory borders are broken. In the case of integer division by zero the signal is SIGFPE. But there is absent signal for integer overflow. — Tedim, May 29 '21 at 13:12

dbush · Answer 1 · 2021-05-27T18:06:32.720

5

This is not a bug. Your expectations are incorrect.

Unsigned integer arithmetic is performed modulo the largest value the type can hold +1. Loosely speaking, this means that for a 64 bit type that anything that overflows 64 bits is truncated.

In your particular case, you're multiplying 12589254118 * 12589254118. The arithmetic result of this is 158,489,319,247,579,957,924. This is larger than what will fit in a 64 bit type, so the result is modulo 2⁶⁴ which gives us 10,915,366,657,903,544,996, and dividing by 10000000000 gives us 1,091,536,665 which matches what the C code generates.

gcc supports 128 bit types, so you can fix this by casting one of the operands to __int128 to perform the math with 128 bits.

uint64_t c = (unsigned __int128)a * b / 10000000000LL;

edited May 27 '21 at 18:06

answered May 27 '21 at 15:45

dbush

162,826
18
167
209

You'd normally want `unsigned __int128` to do unsigned division. (Fun fact: `div r64` is faster than `idiv r64` on Intel CPUs before Ice Lake, although GCC's signed `__int128` helper function will use unsigned `div` for the low part: [Why is \_\_int128\_t faster than long long on x86-64 GCC?](https://stackoverflow.com/a/63034921)). – Peter Cordes May 27 '21 at 18:01
Anyway, using inline will still be faster: GCC / clang helper functions only IIRC check for the upper half of the 128-bit dividend being 0, not just being less than the divisor (unsigned `d:a / x` only faults if `d>=x`), so they won't end up using a single `div` instruction for large product. And of course there's the function-call and checking overhead. So while it is possible to achieve correctness with pure GNU C, this is one of the rare use-cases for inline asm. (AFAIK, there isn't an intrinsic / built-in for division without the guard-rails, allowing you to use a custom high half.) – Peter Cordes May 27 '21 at 18:04
Thank you. It is my omit because I thought I was within uint64_t digit bit net. Yes, "unsigned __int128" helps to show the decision using only C language, but is not quite useful in practice because the compiled code calls __divti3 function which is complex and contains div instruction. In comparison with "mul %2", "mov $10000000000, %%r8", and "div %%r8", the "unsigned __int128" version is very very ineffective. – Tedim May 27 '21 at 19:18
@Tedim *because the compiled code calls __divti3 function which is complex* - have you tried this with optimization enabled? – tum_ May 27 '21 at 22:14
2

@tum_: Read my previous comments and the linked Q&A; GCC doesn't try to prove that `a * (unsigned __int128)b / 123456` can be done with a single widening-`mul` ; narrowing-`div` without the `div` faulting with a #DE if the 128/64-bit division result doesn't fit in a 64-bit register. (The C abstract machine wouldn't because it's doing 128 / 128-bit => 128-bit division). GCC just calls the libgcc helper function, and inside *that* function, checks for the high halves being zero. This is true even with optimization. – Peter Cordes May 28 '21 at 00:38
1

@tum_: _have you tried this with optimization enabled?_ Yes. "__divti3" is called anyway and consist of 126 instructions including some div-s. I can confirm that Peter Cordes says. – Tedim May 28 '21 at 06:20

Two uint64_t numbers multiplication and division on coefficient in the C program are wrong compiled with GCC

1 Answers1