This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell.
So according to the awesome, awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only p1, when FMA can use either p0 or p1 and it does both ADD and MUL. Am I misunderstanding the table? Or can someone explain why that would be?
That is, if my reading is correct, why wouldn't Intel just use FMA op as the basis for both plain MUL and plain ADD, and thereby increasing thruput of ADD as well as MUL. Alternatively, what would stop me from using two simultaneous, independent FMA ops to emulate two simultaneous, independent ADD ops? What are the penalties associated with doing ADD-by-FMA? Obviously, there is a greater number of registers used (2 reg for ADD vs 3 reg for ADD-by-FMA), but other than that?