According to Agner's instruction tables, a single fp division is slower than a single reciprocal op and a single multiply op. (This seems to be common among the x86 architectures measured)
This is an excerpt from a table for the piledriver architecture.
MULSS MULSD x,x/m 1 5-6 0.5 P01 fma
MULPS MULPD x,x/m 1 5-6 0.5 P01 fma
VMULPS VMULPD y,y,y/m 2 5-6 1 P01 fma
DIVSS DIVPS x,x/m 1 9-24 5-10 P01 fp
VDIVPS y,y,y/m 2 9-24 9-20 P01 fp
DIVSD DIVPD x,x/m 1 9-27 5-10 P01 fp
VDIVPD y,y,y/m 2 9-27 9-18 P01 fp
RCPSS/PS x,x/m 1 5 1 P01 fp
The 4th value is latency. So the multiply ops take 5-6, the division ops take 9-24, and the reciprocal op takes 5 cycles. Since 24 > 6 + 5, I'm wondering why the 2 separate ops are faster than 1 single op to get essentially the same result.
I suspect the answer to this question involves the measurement of error. Perhaps it's the case that division is much more accurate than reciprocal plus multiply. If this is the case, how does the error measurement compare? Is there a linear relationship for example, since division is nearly twice as slow as reciprocal + multiply, is it also twice as accurate?