4

I need to find the index of the value that is X or more % below the last rolling maximum peak.

The peak is a rolling maximum of the elements in one array (highs), while the values are in another array (lows). Arrays have equal length and values are guaranteed to be <= corresponding items in the peaks array, there are no 0, NAN, or infinity elements. since is guaranteed to be less than till.

Iterative implementation is straightforward:

inline
size_t trail_max_intern(double *highs,
        double *lows,
        double max,
        double trail,
        size_t since,
        size_t till)
{
    for (; since < till; ++since) {
        if (max < highs[since]) {
            max = highs[since];
        }

        if (lows[since] / max <= trail) {
            break;
        }
    }

    return since;
}

size_t trail_max_iter(double *highs, double *lows, double trail, size_t since, size_t till)
{
    trail = 1 - trail;
    return trail_max_intern(highs, lows, highs[since], trail, since, till);
}

This is basically a linear search with a somewhat modified condition. Because of the task context (arbitrary since, till, and trail values), no other structure or algorithm can be used.

To get some speedup I thought to use AVX2 vectorization extension and see what happens. My result looks like this:

size_t trail_max_vec(double *highs, double *lows, double trail, size_t since, size_t till)
{
    double max = highs[since];
    trail = 1 - trail;

    if (till - since > 4) {
        __m256d maxv = _mm256_set1_pd(max);
        __m256d trailv = _mm256_set1_pd(trail);

        for (size_t last = till & ~3; since < last; since += 4) {
            __m256d chunk = _mm256_loadu_pd(highs + since); // load peak block

            // peak rolling maximum computation
            maxv = _mm256_max_pd(maxv, chunk);
            // propagating the maximum value to places 2, 3, and 4
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b10010000));
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b01000000));
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b00000000));

            // divide lows by rolling maximum
            __m256d res = _mm256_div_pd(_mm256_loadu_pd(lows + since), maxv);
            // and if it is lower than the given fraction return its index
            int found = _mm256_movemask_pd(_mm256_cmp_pd(res, trailv, _CMP_LE_OQ));
            if (found) {
                return since + __builtin_ctz(found);
            }

            maxv = _mm256_set1_pd(maxv[3]);
        }

        max = maxv[3];
    }

    // make sure trailing elements are seen
    return trail_max_intern(highs, lows, max, trail, since, till);
}

It yields correct results, but it is ~2 times slower than the iterative version. I'm definitely doing something wrong, but I can't figure out what.

So my question is, what is wrong with my approach and how do I fix it?

P. S. Complete source with benchmarks is available at https://godbolt.org/z/e5YrTo

Daniel
  • 4,017
  • 8
  • 32
  • 47
  • 1
    Your problem description is incomprehensible. That said, did you know the following fun fact: divisions tend to be much slower than multiplications? – EOF Feb 07 '21 at 15:04
  • @EOF is it better now? I do, although it is used in both versions and I'm not sure if it is the culprit. – Daniel Feb 07 '21 at 15:08
  • I'm still not convinced I understand what you're trying to do. What I *do* see is that you're not using the same algorithm for your vectorized function. You're recomputing the prefix maximum within the function, rather than loading the precomputed prefix maximum from the function parameter. Since you need to shuffle the values around, you're wasting time there, potentially more than you save on memory access time. I'm not sure your compiler can figure out a way to replace the division with a multiplication in the scalar code, and it *probably* doesn't in the vector code. – EOF Feb 07 '21 at 15:21
  • @EOF "prefix maximum" is something that is being updated as soon as a higher peak is seen, it is not constant. I got the message about the division and I'll try to replace div with mul manually, thanks. – Daniel Feb 07 '21 at 15:24
  • 1
    @EOF replacing division with multiplication has improved the speed somewhat for both versions, but that did not affect the difference between iterative & vectorized versions. Vectorized version is still ~2 slower. – Daniel Feb 07 '21 at 15:36

2 Answers2

5

This creates a very long dependency chain:

        // peak rolling maximum computation
        maxv = _mm256_max_pd(maxv, chunk);
        // propagating the maximum value to places 2, 3, and 4
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b10010000));
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b01000000));
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b00000000));

        // ...

        maxv = _mm256_set1_pd(maxv[3]);

i.e., you have 8 instructions all depending on the result of the previous instruction. Assuming each has a latency of 3 cycles, this alone requires 24 cycles for 4 elements (the remaining operations can likely happen within that time, especially if you compare lows[since] <= trail * max -- assuming max > 0).

To reduce the dependency chain, you should first compute the "local" rolling maximum within chunk and compute the maximum of that with maxv afterwards:

        chunk = _mm256_max_pd(chunk, _mm256_movedup_pd(chunk));                  // [c0, c0c1, c2, c2c3]
        chunk = _mm256_max_pd(chunk, _mm256_permute4x64_pd(chunk, 0b01010100));  // [c0, c0c1, c0c1c2, c0c1c2c3]

        __m256d max_local = _mm256_max_pd(maxv, chunk);

To compute the next maxv you can either broadcast max_local[3] (giving a total latency of ~6 cycles), or first broadcast chunk[3] and compute the maximum of that with maxv (this would leave just one maxpd in the dependency chain, and you would clearly be limited by throughput). Benchmarking this on godbolt produced to much noise to decide which one is better in your case.

Furthermore, you can consider working on larger chunks (i.e., load two consecutive __m256ds, compute a locale rolling maximum for these, etc)

chtz
  • 14,758
  • 4
  • 22
  • 49
  • Brilliant! Replacing the "horizontal" rolling max with "local" rolling max + comparison has made the vectorized version 2x faster. Thank you very much :) Although, could you please elaborate on the second approach to computing the maxv for the next loop? I'm not sure I follow why should chunk[3] be broadcasted and then compared again if maxv[3] should already be the "global" rolling maximum if I'm getting it right. Or am I getting it wrong? – Daniel Feb 07 '21 at 18:11
  • 1
    Computing `_mm256_set1_pd(max_local[3])` requires one `maxpd` one `vpermpd` instruction to compute the next `maxv` from the previous, i.e., each iteration will require at least the latency of these instructions. If the remaining operations take at least the same time, you are fine, otherwise the next loop will have to wait until the previous `vpermpd` is finished. – chtz Feb 07 '21 at 18:19
  • Removing optimization (-O3) makes vectorized version to be slower ~1.5x. I suspect that this is related to the way compiler handles those _mm* functions, but not sure. Could you please recommend some reading on the vectorization and how to make it work properly? Seems that just googling things doesn't help :( – Daniel Feb 07 '21 at 19:07
  • 2
    @Daniel: Benchmarking without optimization is pointless and meaningless in general ([Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394)), but yes it's a well-known fact that intrinsics slow down even more than usual. They're often implemented as inline functions, not just macros for __builtin functions, so their return-value objects sometimes get even more extra store/reload and copying than "normal" code. Also we tend to write one intrinsic per line instead of a larger expression, making -O0 worse. – Peter Cordes Feb 07 '21 at 21:16
  • 1
    @Daniel: If you want to compare against scalar code, compile the scalar source with `-O3 -march=native -fno-tree-vectorize`. – Peter Cordes Feb 07 '21 at 21:17
  • @Daniel: I posted an answer with one part unfinished, but if that case doesn't come up in testing (or often in general) then speed looks good: like 3x faster than scalar (with scalar optimized to use multiply when there's a new max). – Peter Cordes Feb 07 '21 at 23:31
3

Especially in the scalar version, I'd calculate max * trail once when finding a new max, then the other condition is just lows[since] <= threshold, not involving any computation.
This is even better than replacing max / trail with max * (1./trail) (suggested in comments). Division is higher latency, and can be a bottleneck if you exceed the divider unit's limited throughput (which is worst for 256-bit double vectors).

OTOH, GCC is optimizing your current if to a maxsd, so new-max is branchless. Another alternative is if (lows[since] * (1./trail) <= max) to trick the compiler into using maxsd and mulsd every iteration, instead of conditionally skipping the multiply in if (lows[since] <= max * trail). But for your test microbenchmark on Godbolt, using a threshold seems best even though that means more branching. (But less work per iteration in the common case, although GCC isn't unrolling and it's a small test whose data might be too "easy".)


In branchless SIMD that's not helpful1 because you'd have to do all the work every time anyway, but if a new max is rare-ish (e.g. highs[] large and uniformly distributed) it could be worth branching on that, in order to do much less shuffling and maxing work in the common case. Especially if you're tuning to run on a CPU with hyperthreading / SMT; the other logical core's uops can keep the core busy while it recovers from a branch mispredict. (If new maxes are very common, though, like if highs is on average increasing, not uniform, this will be bad.)

note 1: assuming you're already following @chtz's answer, which shows how to keep most of that work off the critical path (loop carried dependency chain) latency bottleneck.


The common / fast case will be blazing through vectors just checking two conditions (highs > max or lows <= thresh). If either of those things are true, you run the "expensive" code to do the correct scan order within a vector, and then check thresholds. You do still need to correctly handle the case where high[1] is a new max and lows[1..3] are not below that new threshold, even if they were below the old max. And also the case where lows[0] is below the old threshold, but not below the new threshold. And of course multiple new maxes in one vector are possible.

        __m256d maxv = _mm256_set1_pd(max);
        __m256d thresh = _mm256_mul_pd(maxv, _mm256_set1_pd(trail));
        for() {
            __m256d vhi = _mm256_loadu_pd(highs + since);
            __m256d vlo = _mm256_loadu_pd(lows + since);
    
            __m256d vtrough = _mm256_cmp_pd(vlo, thresh, _CMP_LE_OQ);  // possible candidate
            __m256d newmax = _mm256_cmp_pd(vhi, maxv, _CMP_GT_OQ);     // definite new max

            // __m256d special_case = _mm256_or_pd(vtrough, newmax);
            // The special case needs trough separately, and it's only slightly more expensive to movmskpd twice.
            // AVX512 could use kortest between two compare results.

            unsigned trough = _mm256_movemask_pd(vtrough);
            // add/jnz can macro-fuse on Intel, unlike or/jnz, so use that to check for either one being non-zero.  AMD Zen doesn't care.
            if (trough + _mm256_movemask_pd(newmax)) {
                unsigned trough = _mm256_movemask_pd(vtrough);
                if (trough) {
                    // ... full work to verify or reject the candidate, hopefully rare.
                    // Perhaps extract this to a helper function
                    int found = ...;
                    // This becomes trivial (found = trough) in the no-new-max case, but that may be rare enough not to be worth special-casing here.
                    if (found) {
                        return since + __builtin_ctz(found);
                    }
                    maxv = _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(3,3,3,3));
                } else {
                    // just a new max, no possible trough even with the lowest threshold
                    // horizontal max-broadcast of the current chunk, replacing old maxv (no data dependency on it)
                    maxv = _mm256_max_pd(vhi, _mm256_shuffle_pd(vhi,vhi, _MM_SHUFFLE(2,3,0,1)));   // in-lane swap pairs so both hold max of that 128-bit chunk
                    maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(1,0,3,2)));    // swap low and high halves.  vperm2f128 would also work, but slower on Zen1.
                }

                thresh = _mm256_mul_pd(maxv, _mm256_set1_pd(trail));
            }
        }

Probably not worth further special casing the no-newmax case to just use the current maxv and thus the current trough you already computed, unless you expect finding a trough to be just as common if not more than a new max. (If this was a single array that changed gradually (not separate high and low), that would be common: rare to find a single vector that included both a new max and a trough.)

maxv = _mm256_set1_pd(maxv[3]); probably compiles efficiently, but some compilers are dumber than others when inventing shuffles to implement set1. Also, vector[] is specific to how GNU C implementations (gcc/clang) define __m256d, and isn't portable.
Use maxv = _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(3,3,3,3)); to broadcast the high element with a vpermpd ymm, ymm, imm8 instruction.

Just to see if it compiled, and rough idea of speed, I put it on Godbolt (with int found = trough; which mis-handles the case where there's a new max and a possible trough candidate).
On Godbolt's Skylake-server CPUs, obviously benchmark results are noisy from load on AWS, but the vectorized version varied from 33ms to 55ms for 100 reps, vs. ~95 ms for the scalar version. Before optimizing the scalar version to use lows[since] <= thresh with updating of thresh = max*trail; in the new-max case, times for scalar bounced around from ~120 to 127ms. (Probably CPU frequency and other warm-up effects making the first one tested much more variable for such a short timed interval.)

I used -march=haswell instead of -mavx2 because GCC's default tune settings are poor for modern CPUs with 256-bit possibly-unaligned loads (good for Sandybridge, not Haswell or Zen 2): Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?.

The actual asm for the loop is efficient like I expected:

# gcc -O3 -march=haswell, inner loop
        vmovupd ymm0, YMMWORD PTR [rdi+rdx*8]
        vmovupd ymm7, YMMWORD PTR [rsi+rdx*8]
        vcmppd  ymm3, ymm0, ymm2, 30
        vcmppd  ymm1, ymm7, ymm4, 18
        vmovmskpd       ecx, ymm3
        vmovmskpd       r8d, ymm1
        add     ecx, r8d                # destroys movemask(newmax) and sets FLAGS
        je      .L16

        test    r8d, r8d
        jne     .L36                    # jump if trough, else fall through to new max.  In this hacky version, finding non-zero trough always exits the loop because it ignores the possibility of also being a new max.  The asm will *probably* look similar once handled properly, though.

        vshufpd ymm1, ymm0, ymm0, 1
        vmaxpd  ymm0, ymm0, ymm1
        vpermpd ymm2, ymm0, 78
        vmaxpd  ymm2, ymm0, ymm2       # new max broadcast into YMM2
        vmulpd  ymm4, ymm2, ymm6       # update thresh from max

.L16:
        add     rdx, 4
        cmp     r9, rdx
        ja      .L19

Note that the common case (just comparing) has no loop-carried data dependency (except the pointer increment).

The new-max case also has no data dependency on the old max, only a control dependency (which branch prediction + speculative execution can speculate through, if we're seeing a predictable run of vectors with a new max but not a trough candidate.)

So the complex case code that handles a trough candidate doesn't have to worry too much about the length of the data dependencies, although the sooner a new max is ready the sooner out-of-order exec can start checking branch conditions on later iterations.


Also note that you can safely do the last 0..2 iteration with SIMD if you want, because it's safe to redo elements you've already looked at. max is monotonically increasing (and thus so is threshold) so you'll never accept a lows[] that you shouldn't have. So you can do one final vector that ends at the last element of your range, as long as the total size is >= 4 so you don't read outside of the array.

In this case it may not be worth it (because each SIMD iteration is so much more expensive than scalar), except maybe just doing a quick check to see if there's a possible new max and/or low candidate before running scalar cleanup.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Thank you for your deep and detailed answer. It took me quite a while to wrap my head around it and I'm still not sure that I understand 100% of it. Both your and @chtz answers are definitely something that requires a good understanding of the low-level platform details and thinking in a "vectorized" manner. Could you please recommend some reading that would help me gain a better understanding of vectorization? – Daniel Feb 08 '21 at 19:44
  • 2
    @Daniel: There's some info in https://stackoverflow.com/tags/sse/info, especially Slides + text: from a talk: [SIMD at Insomniac Games (GDC 2015)](https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/). Other than that, look at some examples of hand-vectorized loops, and understand CPU architecture (latency vs. throughput http://www.lighterra.com/papers/modernmicroprocessors/), e.g. [this Q&A](https://stackoverflow.com/q/45113527) about unrolling a dot product with multiple accumulators to hide FP FMA latency. – Peter Cordes Feb 08 '21 at 20:15
  • there's a lot to dig in, thank you. – Daniel Feb 08 '21 at 20:42