5

This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell.

So according to the awesome, awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only p1, when FMA can use either p0 or p1 and it does both ADD and MUL. Am I misunderstanding the table? Or can someone explain why that would be?

That is, if my reading is correct, why wouldn't Intel just use FMA op as the basis for both plain MUL and plain ADD, and thereby increasing thruput of ADD as well as MUL. Alternatively, what would stop me from using two simultaneous, independent FMA ops to emulate two simultaneous, independent ADD ops? What are the penalties associated with doing ADD-by-FMA? Obviously, there is a greater number of registers used (2 reg for ADD vs 3 reg for ADD-by-FMA), but other than that?

codechimp
  • 1,086
  • 10
  • 18
  • 2
    Pure speculation: The FPU on port-0 for Haswell can only handle 5-cycle instructions. It doesn't have "early-out" logic that lets it handle both 3 and 5-cycle instructions. FP-add is a 3-cycle instruction, therefore it can't go into port-0. – Mysticial Mar 04 '15 at 19:21
  • As a long overdue update: Intel did end up using the FMA for ADDs as well - on Skylake that is. Skylake reduces the FMA latency to 4 cycles. That seems to have been enough of a trade-off for them to do away with the dedicated 3-cycle FP-ADD and shove it into the 4-cycle FMA hardware. So now we have dual-issue FP-ADD as well. – Mysticial Jun 30 '17 at 19:53

1 Answers1

5

You're not the only one confused as to why Intel did this. Agner Fog in his micro-architecture manual writes for Haswell:

It is strange that there is only one port for floating point addition, but two ports for floating point multiplication.

On Agner's message board he also writes

There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications.

That thread continues with more information on the subject which I suggest you read but I won't quote here.

He also discusses it in this answer here flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.

This is possible indeed, but who would make such a weird optimization for one specific processor?

His answer there basically answers your question. You can use FMA to double the throughput of addition. In fact I do this in my throughput tests for addition and indeed see that it doubles.

In summary, for addition, if your calculation is latency bound then don't use FMA use ADD. But If it's throughput bound you can try and use FMA (by setting the multiplier to 1.0) but you will probably have to use many AVX registers to do this.

I unrolled 10 times to get maximum througput here loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell

Z boson
  • 29,230
  • 10
  • 105
  • 195
  • "who would make such a weird optimization for one specific processor?" - Prime95 does it. And I've done it as well. It's not difficult at all when all your intrinsics go through custom macros. – Mysticial Mar 05 '15 at 09:07
  • @Mysticial, yeah, I did it for my throughput tests as well. But I have not done it for any thing useful yet. I guess for my GEMM code but then I already unrolled 8x anyway and going from 8x to 10x barely makes a difference. – Z boson Mar 05 '15 at 09:09
  • Thanks for the feedback. I did not see Agner's comment on this issue. I've been only studying his table. I will take a look at his other notes. I understand the point about latency/tradeoff, though I'm still learning the fine art managing the two. What I was more unsure of was whether there would be some non-intuitive port conflict or precision error. – codechimp Mar 05 '15 at 17:54
  • @Z boson, thanks for the your link on loop unrolling. But I had already found it yesterday, prior to uploading this question, and have already bookmarked it for study. Intuitively, I understand the value of unrolling, but the the technical details of managing latency and throughput is not similarly intuitive to me, yet. – codechimp Mar 05 '15 at 18:05