Optimizing x64 assembler MUL loop

Question

I'm writing math code which needs to multiply large numbers fast. It breaks down to multiplications of an array of integers with a single integer. In C++ this looks like this (on unsigned's):

void muladd(unsigned* r, const unsigned* a, unsigned len, unsigned b) {
   unsigned __int64 of = 0;  // overflow
   unsigned i = 0;  // loop variable
   while (i < len) {
      of += (unsigned __int64)a[i] * b + r[i];
      r[i] = (unsigned)of;
      of >>= 32;
      ++i;
   }
   r[i] = (unsigned)of;  // save overflow
}

I unrolled this loop manually, converted it to 64 bit and worked on the .asm compiler output to optimize it further. The main .asm loop now looks like this:

mov   rax, rdi                             ; rdi = b
mul   QWORD PTR [rbx+r10*8-64]             ; rdx:rax = a[i] * b; r10 = i
mov   rsi, QWORD PTR [r14+r10*8-64]        ; r14 = r; rsi = r[i]
add   rax, rsi
adc   rdx, 0
add   rax, r11                             ; r11 = of (low part)
adc   rdx, 0
mov   QWORD PTR [r14+r10*8-64], rax        ; save result
mov   r11, rdx

; this repeats itself 8 times with different offsets

When I benchmark this, I find that it takes about 6.3 cycles on avarage per multiplication on my Core2 Quad.

My question is: can I speed this up somehow? Unfortunately, I see no way to avoid one of the additions and the multiplication always needs RDX:RAX, so I need to move the data around and can not sort of "multiply in parallel".

Any ideas anyone?

Update: After some more testing, I've managed to bring the speed up to about 5.4 cycles per 64-bit MUL (that includes all add, move and loop overhead). I guess this is about the best you can get on a Core2, since the Core2 does not have a very fast MUL instruction: it has a throughput of 3 and a latency of 6 (resp. 7) cycles. Sandy bridge will be much better with a throughput of 1 and a latency of 3 (resp. 4) cycles.

Regarding the much lesser number for GMP: I got that from their source code and it seems to me that it is a theoretical number. But what's sure is that it's a number that was calculated for an AMD K9 CPU. And from what I have read I gather the AMDs have a faster MUL unit than the (older) Intel chips.

You might want to take a look at some of the assembly routines in GMP. They have a function that does exactly this and is written in assembly for most processors including x64. — Mysticial, Nov 14 '11 at 16:28
GMP really has good support for a fast mul_basecase and appearantly it takes them some 2.35 cycles per MUL, very nice. If I understood it correctly, they multiply two vectors interleaved, that seems to keep the dependencies low and improve overflow handling. — cxxl, Nov 14 '11 at 20:24

score 1 · Answer 1 · answered Feb 28 '12 at 00:46

I once wrote a loop that looks rather like this, with a minimal amount of processing on a lot of data with the result that the loop was limited by memory speed.

I'd try prefetching a[i] and r[i]

if using gcc use the function __builtin_prefetch() or PREFETCHT0 instruction in assembler

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

When this works the results can be dramatic. So long as the loop is a thousand or so iterations long, I'd prefetch a[i+64] and r[i+64] as a starting point and see how much difference it makes on your CPU. You may need to try larger prefetch distances.

I tried that. The result was that it mad no difference on my Core2 Quad. Skimming through the CPU manuals and Agner Fog's guides, I get the idea that today's processors have a good prefetch logic that detects simple loops pretty well, so no manual prefetching is necessary. — cxxl, Feb 28 '12 at 15:51

score 0 · Answer 2 · answered Feb 18 '12 at 12:04

0

I'd just like to point out that cycle-counting is rather useless as your instructions will be converted to microcode that will be executed out of order or paused depending on everything else the cpu is doing. If you have a fast routine, which you do, it's not really fruitfull to try and shave off a theoretical cycle unless you know your routine will always run in complete isolation.

answered Feb 18 '12 at 12:04

Tobias

652
3
12

1

The OP benchmarked his code, and obviously got repeatable results. He did not count theoretical cycles, he did actually measure practical cycles. The way instructions are translated to microcode and are reordered are predictable and pretty good known (see www.agner.org). Furthermore _complete_ isolation isn't needed for optimizing, a OS in the background running the code will usually not shave off more than a few percent, if at all. – Gunther Piez Feb 18 '12 at 12:21
This should be a comment, not a reply. – Mahmoud Al-Qudsi Feb 20 '12 at 07:23

score 0 · Answer 3 · edited May 23 '17 at 12:24

Does r contain anything important before the call ?

If it does, and you're accumulating onto it, then stop reading now.

If it doesn't (ie you're always accumulating onto zeros), and assuming you're invoking this function on arrays significantly larger than cache sizes, then I'd be looking for a way to eliminate the need to read from r and to convert the "save result" MOV to a MOVNT (_mm_stream_ps in intrinsics).

This can significantly improve performance. How ? Currently your caches are fetching cache-lines from a, fetching cache lines from r and writing cache lines back to r. With the so-calling streaming stores you'd just fetch cache lines from a and write-through straight to r: much less bus traffic. If you look at any modern CRT's memcpy implementation it will switch to using streaming stores above some cache-size related threshold (and run almost twice as fast as a memcpy using conventional moves).

That is very interesting. The `r` is empty on call of the function, but gets slowly filled up. Plus, after the function is done, I expect that it will be used for something (since it's the result :)). I expected MOVNT not to be advantageous, since we are filling `r` in a sequential manner. Agner Fog writes "the method of storing data without caching is advantageous if, and only if, a level-2 cache miss can be expected" ( http://www.agner.org/optimize/optimizing_cpp.pdf ). I think a L2 cache miss can be ruled out in 99%. — cxxl, Feb 29 '12 at 14:09

score -1 · Answer 4 · answered Nov 14 '11 at 16:41

-1

Looks like your routine could benefit from SSE. PMULLD and PADDD seems like relevant instructions. Not sure why your compiler does not produce SSE from that.

answered Nov 14 '11 at 16:41

Jens Björnhager

5,415
3
24
45

That works for 32-bit x 32-bit multiplies. But not for 64-bit x 64-bit multiplies. – Mysticial Nov 14 '11 at 16:43
1

Do you really need qword multiplication when you only keep the most significant dword? – Jens Björnhager Nov 14 '11 at 16:51
I save RAX back to memory and RDX is used as carry (through R11) and is added into the next element. Unfortunately, I need QWORD MUL. – cxxl Nov 14 '11 at 18:19
Shucks. SSE doesn't seem to handle packed quads, what a pity. Still, if you somehow could split those quad multiplications into doubles and do four at a time, perhaps you could gain some performance. – Jens Björnhager Nov 14 '11 at 18:36
1

Again, if GMP is the answer, they have SSE2 code that uses PMULUDQ, but comments suggest it takes about 4.9 cycles per QWORD. So I better try the interleaved MULs. – cxxl Nov 14 '11 at 20:27

Optimizing x64 assembler MUL loop

4 Answers4