Estimating the efficiency of GPU in FLOPS (CUDA SAMPLES)

Question

It seems to me, that I don't completely understand the conception of FLOPS. In CUDA SAMPLES, there is Matrix Multiplication Example (0_Simple/matrixMul). In this example the number of FLOPs (operations with floating point) per matrix multiplication is calculated via the formula:

 double flopsPerMatrixMul = 2.0 * (double)dimsA.x * (double)dimsA.y * (double)dimsB.x;

So, this means, that in order to multiply matrix A(n x m) over B(m x k), we need to do: 2*n*m*k operations with floating point.

However, in order to calculate 1 element of the resulting matrix C (n x k), one have to perform m multiplication and (m-1) addition operations. So, the total number of operations (to calculate n x k elements), is m*n*k multiplications and (m-1)*n*k additions.

Of course, we could set the number of additions to m*n*k as well, and the total number of operations will be 2*n*m*k, half of them are multiplications and half additions.

But, I guess, multiplication is more computationally expensive, than addition. Why this two types of operations are mixed up? Is it always the case in computer science? How can one take into account two different types of operations?

Sorry for my English)

Most architectures (including nVidia GPUs) support fused multiply-add, so you effectively get the adds for free and just count the multiplies. — Paul R, Dec 16 '14 at 17:19
But in this case the answer must be `m*n*k`, which is twice smaller, than in example — Mikhail Genkin, Dec 16 '14 at 17:24

score 2 · Accepted Answer · answered Dec 16 '14 at 17:30

The short answer is that yes, they count both the multiplications and the additions. Even though most floating point processors have a fused multiply/add operation, they still count the multiply and add as two separate floating point operations.

This is part of why people have been complaining for decades that FLOPs is basically a meaningless measurement. To mean even a little, you nearly need to specify some particular body of code for which you're measuring the FLOPs (e.g., "Linpack gigaflops"). Even then, you sometimes need fairly tight control over things like what compiler optimizations are allowed to assure that what you're measuring is really machine speed rather than the compiler's ability to simply eliminate some operations.

Ultimately, it's concerns like these that have led to organizations being formed to set up benchmarks and rules about how those benchmarks must be run and results reported (e.g., SPEC). Otherwise, it can be difficult to be at all certain that the results you see reported for two different processors are really comparable in any meaningful way. Even with it, comparisons can be difficult, but without such things they can border on meaningless.

Ok, so as far as I understand, there is a lot of ambiguity in such estimations. Thanks — Mikhail Genkin, Dec 16 '14 at 17:44

Estimating the efficiency of GPU in FLOPS (CUDA SAMPLES)

1 Answers1

Linked