Understanding FMA instructions performance

Question

i'm tring to understand how can i max out the number of operations i can get on my CPU. I'm doing a simple matrix multiplication program, and i have a Skylake processor. I was looking at the wikipedia page for the flops information on this architecture, and i'm having dificulties understanding it.

From my understanding, FMA instructions allow 3 way FP inputs right? And allow to mix between adds and multiplies between them. But what happens when i only add two floats? Does it simply multiply it by one? Can i add 3 floats in 1 cycle, or will that be split? I saw that the skylake, has 32 FLOPs/cycle for single precision inputs, but what's the meaning of "two 8-wide FMA instructions"?

Thank you in advance for the explanations

This question becomes more interesting if you compare Haswell and Skylake. Haswell can only do one AVX add per clock cycle but two FMA operations per clock cycle. This means that you can double your addition throughput by using two FMA operations multiplying by 1.0. OTH, the latency for FMA is 5 whereas addition is three on Haswell so you have to use 10 parallel accumulators to get the maximum throughput WITH FMA whereas you only need 3 with addition. On Skylake addition and FMA have the same latency and throughput so there is no reason to use FMA for addition. — Z boson, Feb 14 '17 at 10:14

score 8 · Accepted Answer · answered Jan 08 '17 at 00:16

FMA calculates ± a*b ± c in a single operation, with a single rounding error. That's what it does, nothing else. Calculating a + b + c cannot be done using an FMA instruction; you need two dependent ADD operations for that.

Depending on the compiler, you may have to turn a compiler option to allow use of FMA instructions, because they don't give results identical to multiply followed by add. And you may have to re-arrange your code in some cases, for example ab + cd + e will be calculated as x = ab; y = FMA (c, d, x), z = y + e but e + ab + c*d will be calculated as x = FMA (a, b, e); z = FMA (c, d, x). The basic operation calculation of an FFT can be performed with eight floating-point operations and can be rewritten as 10 operations using four FMAs and two other operations.

"Two 8-wide FMA instructions" means it can perform FMA instructions with two 256 bit vector registers containing 8 floats each, and two of these in the same cycle.

One way to make it clear to the compiler that it's ok to use the fused multiply-add assembly instruction is to use the `fma`, `fmaf`, `fmal` functions in the source code, but then if the compiler is set to generate backwards-compatible code and to respect the difference between fma and “`*` followed by `+`”, these functions will be compiled as expensive sequences of many instructions, either like https://sourceware.org/bugzilla/attachment.cgi?id=6017 or like https://sourceware.org/ml/libc-hacker/2010-10/msg00005.html — Pascal Cuoq, Jan 08 '17 at 23:52
It would be awesome if there was a fast single rounding mode `a + b + c` instruction. This would make `double-double` addition fast which currently is much slower than `double-double` multiplication with FMA. http://stackoverflow.com/a/30643684/2542702 — Z boson, Feb 14 '17 at 10:18

Understanding FMA instructions performance

1 Answers1