Regular division as fast as multiplication with approximate reciprocal. Why?

Question

I have following codes:

void division_approximate(float a[], float b[], float c[], int n) {
    // c[i] = a[i] * (1 / b[i]);
    for (int i = 0; i < n; i+=8) {
        __m256 b_val = _mm256_loadu_ps(b + i);
        b_val = _mm256_rcp_ps(b_val);
        __m256 a_val = _mm256_loadu_ps(a + i);
        a_val = _mm256_mul_ps(a_val, b_val);
        _mm256_storeu_ps(c + i, a_val);
    }
}

void division(float a[], float b[], float c[], int n) {
    // c[i] = a[i] / b[i];
    for (int i = 0; i < n; i+=8) {
        __m256 b_val = _mm256_loadu_ps(b + i);
        __m256 a_val = _mm256_loadu_ps(a + i);
        a_val = _mm256_div_ps(a_val, b_val);
        _mm256_storeu_ps(c + i, a_val);
    }
}

I would expect that division_approximate is faster than division, but both function takes almost the same time on my AMD Ryzen 7 4800H. I don't understand why, I would expect that the division_approximate is significantly faster. This issue reproduces on both GCC an CLANG. Compiled with -O3 -march=core-avx2.

UPDATE

Here is the source code generated by GCC 9.3 for both loops:

division
│  >0x555555555c38 <division+88>    vmovups 0x0(%r13,%rax,4),%ymm3                                                                                                                                                                                                                       │
│   0x555555555c3f <division+95>    vdivps (%r14,%rax,4),%ymm3,%ymm0                                                                                                                                                                                                                     │
│   0x555555555c45 <division+101>   vmovups %ymm0,(%rbx,%rax,4)                                                                                                                                                                                                                          │
│   0x555555555c4a <division+106>   add    $0x8,%rax                                                                                                                                                                                                                                     │
│   0x555555555c4e <division+110>   cmp    %eax,%r12d                                                                                                                                                                                                                                    │
│   0x555555555c51 <division+113>   jg     0x555555555c38 <division+88>                                                                                                                                                                                                                  │

division_approximate
│  >0x555555555b38 <division_approximate+88>        vrcpps (%r14,%rax,4),%ymm0                                                                                                                                                                                                           │
│   0x555555555b3e <division_approximate+94>        vmulps 0x0(%r13,%rax,4),%ymm0,%ymm0                                                                                                                                                                                                  │
│   0x555555555b45 <division_approximate+101>       vmovups %ymm0,(%rbx,%rax,4)                                                                                                                                                                                                          │
│   0x555555555b4a <division_approximate+106>       add    $0x8,%rax                                                                                                                                                                                                                     │
│   0x555555555b4e <division_approximate+110>       cmp    %eax,%r12d                                                                                                                                                                                                                    │
│   0x555555555b51 <division_approximate+113>       jg     0x555555555b38 <division_approximate+88>                                                                                                                                                                                      │

Both codes take almost exactly the same amount of time to execute (318 ms vs 319 ms) for n = 256 * 1024 * 1024.

How big is `n`? Did you measure if you are not bound by memory throughput? — chtz, May 25 '21 at 13:16
`n` is huge, like 256 MB. I don't think it is bound by memory throughput, do you know a way to test this? — Bogi, May 25 '21 at 13:58
Zen2 has one per 3.5 cycle `vdivps ymm` throughput. (https://uops.info). Are you achieving that with your exact division function? If not, you're bottlenecking on memory. — Peter Cordes, May 25 '21 at 15:42
256MB won't even fit into L3 cache, I guess. You can run your code on a 256kB/25.6kB/2.56kB array but 1000/10000/100000 times (make sure the array is not too small, or you may get other dependencies -- of course you can scale this differently as well). — chtz, May 25 '21 at 15:42
(And yes, division throughput is one of the few use-cases for `rcpps`. If you can't usefully combine this pass with other work, to keep other vector ALUs busy while `vdivps` works ([Floating point division vs floating point multiplication](https://stackoverflow.com/a/45899202)), then spending more uops to get a less accurate result faster can be worth it. But ideally improve your computational intensity by doing more with your data in each pass over it, and/or cache-block your algorithms to do each step of a calc over the first 128k or so (half typical L2 size) before moving on). — Peter Cordes, May 25 '21 at 15:47
BTW, my earlier comment about vdivps usually being worth it was for the case where you need a Newton iteration to make it sufficiently accurate. But you're not doing that. If raw rcpps + mulps are accurate enough (like 11-bit precision IIRC), then the extra front-end throughput cost is a lot smaller, and it can be lower latency (in this case, that just means less for OoO exec to hide). — Peter Cordes, May 25 '21 at 21:10
You can improve the loop to use non temporal streaming store with `_mm256_stream_ps` (be aware of alignment). This can be up to 30% faster (on processors doing write allocate). — Jérôme Richard, May 26 '21 at 17:48

score 1 · Answer 1 · answered May 27 '21 at 22:14

(256 * 1024 * 1024) * 4 (bytes per float) / 0.318 / 1000^2 * 4 is about 13.5 GB/s DRAM bandwidth, or about 10.1GB/s useful stream bandwidth. (Assuming stores actually cost read+write bandwidth for the RFO; as Jerome points out, _mm256_stream_ps could make the stores just cost once, not twice.)

IDK if that's good or bad for single-threaded triad bandwidth on your Zen 2, but that's only
(256 * 1024 * 1024 / 8) / 0.318 / 1000^3 = ~0.1055 vectors (of 8 floats) per nanosecond; something Zen 2 vdivps could keep up with at 0.36 GHz. I assume your CPU is faster than that :P
(0.1055 vec/ns * 3.5 cycles/vec = 0.36 cycles/ns aka GHz)

So very obviously a memory bottleneck, nowhere near Zen2's one per 3.5 cycle vdivps ymm throughput. (https://uops.info/). Use a much smaller array (that fits in L1 or at least L2 cache) and loop over it multiple times.

Try to avoid writing loops like this in real-world code. The computational intensity (work per time you load data into L1 cache or registers) is very low. Do this as part of another pass, or use cache-blocking to do it for a small part of the input, then consume that small part of the output while it's still hot in cache. (That's much better than using _mm256_stream_ps to bypass cache.)

When mixed with other operations (lots of fmas / mul / add), vdivps is often a better choice than rcpps + a Newton iteration (which is normally needed for acceptable accuracy: raw rcpps is only about 11-bit IIRC.) vdivps is only single uop, vs. separate rcpps and mulps uops. (Although from memory, a separate vmovups load is needed anyway with vdivps, and Zen has no problem folding a memory source into a single uop). See also Floating point division vs floating point multiplication re: front-end throughput vs. divide-unit bottlenecks if you're just dividing, not mixing it with other operations.

Of course it's great if you can avoid division entirely, e.g. hoist a reciprocal out of a loop and just multiply, but modern CPUs have good enough divider HW that there's often not much to gain from rcpps, even if you aren't memory bottlenecked. e.g. in evaluating a polynomial approximation using the ratio of two polynomials, the amount of FMAs is usually enough to hide the throughput cost of vdivps, and a newton iteration would cost more FMA uops.

Also, why -march=core-avx2 when you don't have an Intel "core" microarchitecture? Use -march=native or -march=znver2. Unless you're intentionally benchmarking binaries tuned for Intel running on your AMD CPU.

Regular division as fast as multiplication with approximate reciprocal. Why?

1 Answers1