20

There is _mm_div_ps for floating-point values division, there is _mm_mullo_epi16 for integer multiplication. But is there something for integer division (16 bits value)? How can i conduct such division?

fogbit
  • 1,851
  • 6
  • 26
  • 39
  • Not enough people needed it, so... – Marc Glisse May 29 '13 at 20:27
  • Nope it doesn't exist. It's probably a combination of not enough people needing it along with the difficulty (and die-space) of an integer division unit. – Mysticial May 29 '13 at 20:49
  • Do you want to divide by a constant or by a variable ? – Paul R May 29 '13 at 21:12
  • 2
    A quick way to check: compile `typedef short vec __attribute__((vector_size(16))); vec f(vec x){ return x/7; } vec g(vec x,vec y){ return x/y; }` with gcc and see how it emulates division by a constant but lowers the full division to scalars. If you can find better code, please help by filing a bug report with gcc. (clang should work the same, but I haven't checked) – Marc Glisse May 29 '13 at 21:56
  • @PaulR: I need to divide by a variable, int16 / int16 – fogbit May 30 '13 at 05:19
  • I added an edit that does this with SSE. However, now that I think about it, it might might be more efficient just to store the shorts to an array, do the division without SSE, and then load them back to the register. –  May 31 '13 at 08:11
  • 5
    See also: [libdivide](http://libdivide.com/) – rwong Apr 17 '15 at 07:52

4 Answers4

10

Please see Agner Fog's vectorclass he has implemented a fast algorithm to do integer division with SSE/AVX for 8-bit, 16-bit, and 32-bit words (but not 64-bit) http://www.agner.org/optimize/#vectorclass

Look in the file vectori128.h for the code and a description of the algoirthm as his well written manual VectorClass.pdf

Here is a fragment describing the algorithm from his manual.

"Integer division There are no instructions in the x86 instruction set and its extensions that are useful for integer vector division, and such instructions would be quite slow if they existed. Therefore, the vector class library is using an algorithm for fast integer division. The basic principle of this algorithm can be expressed in this formula: a / b ≈ a * (2n / b) >> n This calculation goes through the following steps: 1. find a suitable value for n 2. calculate 2n / b 3. calculate necessary corrections for rounding errors 4. do the multiplication and shift-right and apply corrections for rounding errors

This formula is advantageous if multiple numbers are divided by the same divisor b. Steps 1, 2 and 3 need only be done once while step 4 is repeated for each value of the dividend a. The mathematical details are described in the file vectori128.h. (See also T. Granlund and P. L. Montgomery: Division by Invariant Integers Using Multiplication, Proceedings of the SIGPLAN."...

Edit: near the end of the file vectori128.h shows how to do short division with a scalar variable "It takes more time to compute the parameters used for fast division than to do the division. Therefore, it is advantageous to use the same divisor object multiple times. For example, to divide 80 unsigned short integers by 10:

short x = 10;
uint16_t dividends[80], quotients[80];         // numbers to work with
Divisor_us div10(x);                          // make divisor object for dividing by 10
Vec8us temp;                                   // temporary vector
for (int i = 0; i < 80; i += 8) {              // loop for 4 elements per iteration
    temp.load(dividends+i);                    // load 4 elements
    temp /= div10;                             // divide each element by 10
    temp.store(quotients+i);                   // store 4 elements
}

"

Edit: integer division by a vector of shorts

#include <stdio.h>
#include "vectorclass.h"

int main() {    
    short numa[] = {10, 20, 30, 40, 50, 60, 70, 80};
    short dena[] = {10, 20, 30, 40, 50, 60, 70, 80};

    Vec8s num = Vec8s().load(numa);
    Vec8s den = Vec8s().load(dena);

    Vec4f num_low = to_float(extend_low(num));
    Vec4f num_high = to_float(extend_high(num));
    Vec4f den_low = to_float(extend_low(den));
    Vec4f den_high = to_float(extend_high(den));

    Vec4f qf_low = num_low/den_low;
    Vec4f qf_high = num_high/den_high;
    Vec4i q_low = truncate_to_int(qf_low);
    Vec4i q_high = truncate_to_int(qf_high);

    Vec8s q = compress(q_low, q_high);
    for(int i=0; i<8; i++) {
        printf("%d ", q[i]);
    } printf("\n");
}
  • 1
    This is only really useful for division by a constant - the OP has said that he wants to divide by a variable, in which case I think the only option is to unpack, convert to float and do the division in float. – Paul R May 30 '13 at 10:40
  • @PaulR, I added an edit in response to your comment. Towards the end of the file vectori128.h he gives an example on how to do divisions of shorts with variables. I have not gone through the details because integer division has not been a bottle neck in my code yet (the times I have had to use it I did it by shifting bits right which is very fast). –  May 30 '13 at 14:07
  • @PaulR, so I looked into this more carefully. The fast method in the VectorClass is works only for scalars. For vector/vector integer division then I think you're correct. The only solution, if you have to do interger vector/vector, is to either convert to floats, do the division, and convert back to shorts. I added an edit doing that. I guess you could also save the shorts to an array do scalar division and then load them back as well. –  May 31 '13 at 08:05
  • 2
    Yes, that pretty much sums it up - you can do division by a (single) constant without resorting to floating point, but anything else requires conversion to floating point and back again. – Paul R May 31 '13 at 09:53
  • @PaulR, maybe we have a different definition of a constant. When I hear constant I think of a compile-time constant. I think by constant you mean a scalar value. His method works well for a variable scalar as well as a compile-time constant scalar. Dividing by a compile-time constant goes even faster. The distinction is described on page 13 of his manual if you care. –  May 31 '13 at 10:33
  • OK - I've never tried this method at run-time (only compile-time) but it seems that the set up cost is relatively high unless you are going to process a significant number of elements with the same scalar dividend, so I guess it's still a scalar constant in the sense that ideally it doesn't vary within an inner loop. If you have a dividend that might vary throughout the inner loop iteration then the method becomes quite inefficient. – Paul R May 31 '13 at 10:42
  • 1
    @PaulR, I think we agree now. –  May 31 '13 at 10:49
10

Math says that it is indeed possible to go faster

Agner Fog's (http://www.agner.org/optimize/#vectorclass) method works great if division is done with a single divisor. Furthermore, this method has even further benefits if the divisor is known at compile time, or if it doesn't change often at runtime.

However, when performing SIMD division on __m128i elements such that no information is available for both the divisor and the dividend upon compile time, we have no option, but to convert to float and perform the computation. On the other, using _mm_div_ps will not provide us with amazing speed improvements, as this instruction has variable latency of 11 to 14 cycles on most micro-architectures and sometimes can go up to 38 cycles if we consider Knights Landing. On top of that this instruction is not fully pipelined, and has reciprocal throughput of 3-6 cycles depending on the micro-architecture.

However we can avoid _mm_div_ps and use _mm_rcp_ss instead.

Unfortunately __m128 _mm_rcp_ss (__m128 a) is fast only because it provides approximation. Namely (taken from the Intel Intrnisics Guide):

Compute the approximate reciprocal of the lower single-precision (32-bit) floating-point element in a, store the result in the lower element of dst, and copy the upper 3 packed elements from a to the upper elements of dst. The maximum relative error for this approximation is less than 1.5*2^-12.

Therefore, to benefit from _mm_rcp_ss, we need to compensate for the loss due to approximation. There is a great work done in this direction, available in Improved division by invariant integers by Niels Möller and Torbjörn Granlund:

Due to lack of efficient division instructions in current processors, the division is performed as a multiplication using a precomputed single-word approximation of the reciprocal of the divisor, followed by a couple of adjustment steps.

To calculate 16-bit signed integer division, we only need one adjustment step, and we base our solution accordingly.

SSE2

static inline __m128i _mm_div_epi16(const __m128i &a_epi16, const __m128i &b_epi16) {
    //
    // Setup the constants.
    //
    const __m128  two     = _mm_set1_ps(2.00000051757f);
    const __m128i lo_mask = _mm_set1_epi32(0xFFFF);
    //
    // Convert to two 32-bit integers
    //
    const __m128i a_hi_epi32       = _mm_srai_epi32(a_epi16, 16);
    const __m128i a_lo_epi32_shift = _mm_slli_epi32(a_epi16, 16);
    const __m128i a_lo_epi32       = _mm_srai_epi32(a_lo_epi32_shift, 16);
    const __m128i b_hi_epi32       = _mm_srai_epi32(b_epi16, 16);
    const __m128i b_lo_epi32_shift = _mm_slli_epi32(b_epi16, 16);
    const __m128i b_lo_epi32       = _mm_srai_epi32(b_lo_epi32_shift, 16);
    //
    // Convert to 32-bit floats
    //
    const __m128 a_hi = _mm_cvtepi32_ps(a_hi_epi32);
    const __m128 a_lo = _mm_cvtepi32_ps(a_lo_epi32);
    const __m128 b_hi = _mm_cvtepi32_ps(b_hi_epi32);
    const __m128 b_lo = _mm_cvtepi32_ps(b_lo_epi32);
    //
    // Calculate the reciprocal
    //
    const __m128 b_hi_rcp = _mm_rcp_ps(b_hi);
    const __m128 b_lo_rcp = _mm_rcp_ps(b_lo);
    //
    // Calculate the inverse
    //
    #ifdef __FMA__
        const __m128 b_hi_inv_1 = _mm_fnmadd_ps(b_hi_rcp, b_hi, two);
        const __m128 b_lo_inv_1 = _mm_fnmadd_ps(b_lo_rcp, b_lo, two);
    #else
        const __m128 b_mul_hi   = _mm_mul_ps(b_hi_rcp, b_hi);
        const __m128 b_mul_lo   = _mm_mul_ps(b_lo_rcp, b_lo);
        const __m128 b_hi_inv_1 = _mm_sub_ps(two, b_mul_hi);
        const __m128 b_lo_inv_1 = _mm_sub_ps(two, b_mul_lo);
    #endif
    //
    // Compensate for the loss
    //
    const __m128 b_hi_rcp_1 = _mm_mul_ps(b_hi_rcp, b_hi_inv_1);
    const __m128 b_lo_rcp_1 = _mm_mul_ps(b_lo_rcp, b_lo_inv_1);
    //
    // Perform the division by multiplication
    //
    const __m128 hi = _mm_mul_ps(a_hi, b_hi_rcp_1);
    const __m128 lo = _mm_mul_ps(a_lo, b_lo_rcp_1);
    //
    // Convert back to integers
    //
    const __m128i hi_epi32 = _mm_cvttps_epi32(hi);
    const __m128i lo_epi32 = _mm_cvttps_epi32(lo);
    //
    // Zero-out the unnecessary parts
    //
    const __m128i hi_epi32_shift = _mm_slli_epi32(hi_epi32, 16);
    #ifdef __SSE4_1__
        //
        // Blend the bits, and return
        //
        return _mm_blend_epi16(lo_epi32, hi_epi32_shift, 0xAA);
    #else
        //
        // Blend the bits, and return
        //
        const __m128i lo_epi32_mask = _mm_and_si128(lo_epi32, const_mm_div_epi16_lo_mask);
        return _mm_or_si128(hi_epi32_shift, lo_epi32_mask);
    #endif
}

This solution can work using SSE2 only and will make use of FMA if available. However, it could be that using plain division is as fast (or even faster) as using the approximation.

In the presence of AVX this solution can be improved, as the high and lo-parts can be processed at the same time using one AVX register.

Validation

As we are dealing with 16-bits only, we can easily validate the correctness of the solution in few seconds using brute-force testing:

 void print_epi16(__m128i a)
{
    int i; int16_t tmp[8];
    _mm_storeu_si128( (__m128i*) tmp, a);

    for (i = 0; i < 8; i += 1) {
        printf("%8d ", (int) tmp[i]);
    }
    printf("\n");
}

bool run_mm_div_epi16(const int16_t *a, const int16_t *b)
{
    const size_t n = 8;
    int16_t result_expected[n];
    int16_t result_obtained[n];
    //
    // Derive the expected result
    //
    for (size_t i = 0; i < n; i += 1) {
        result_expected[i] = a[i] / b[i];
    }
    //
    // Now perform the computation
    //
    const __m128i va = _mm_loadu_si128((__m128i *) a);
    const __m128i vb = _mm_loadu_si128((__m128i *) b);
    const __m128i vr = _mm_div_epi16(va, vb);
    _mm_storeu_si128((__m128i *) result_obtained, vr);
    //
    // Check for array equality
    //
    bool eq = std::equal(result_obtained, result_obtained + n, result_expected);
    if (!eq) {
        cout << "Testing of _mm_div_epi16 failed" << endl << endl;
        cout << "a: ";
        print_epi16(va);
        cout << "b: ";
        print_epi16(vb);
        cout << endl;
        cout << "results_obtained: ";
        print_epi16(vr);
        cout << "results_expected: ";
        print_epi16(_mm_loadu_si128((__m128i *) result_expected));
        cout << endl;
    }
    return eq;
}

void test_mm_div_epi16()
{
    const int n = 8;
    bool correct = true;
    //
    // Brute-force testing
    //
    int16_t a[n];
    int16_t b[n];

    for (int32_t i = INT16_MIN; correct && i <= INT16_MAX; i += n) {
        for (int32_t j = 0; j < n; j += 1) {
            a[j] = (int16_t) (i + j);
        }
        for (int32_t j = INT16_MIN; correct && j < 0; j += 1) {
            const __m128i jv = _mm_set1_epi16((int16_t) j);
            _mm_storeu_si128((__m128i *) b, jv);
            correct = correct && run_mm_div_epi16(a, b);
        }
        for (int32_t j = 1; correct && j <= INT16_MAX; j += 1) {
            const __m128i jv = _mm_set1_epi16((int16_t) j);
            _mm_storeu_si128((__m128i *) b, jv);
            correct = correct && run_mm_div_epi16(a, b);
        }
    }
    if (correct) {
        cout << "Done!" << endl;
    } else {
        cout << "_mm_div_epi16 can not be validated" << endl;
    }
}

AVX2

Having the solution above, AVX2 implementation is straight-forward:

static inline __m256i _mm256_div_epi16(const __m256i &a_epi16, const __m256i &b_epi16) {
    //
    // Setup the constants.
    //
    const __m256 two = _mm256_set1_ps(2.00000051757f);
    //
    // Convert to two 32-bit integers
    //
    const __m256i a_hi_epi32       = _mm256_srai_epi32(a_epi16, 16);
    const __m256i a_lo_epi32_shift = _mm256_slli_epi32(a_epi16, 16);
    const __m256i a_lo_epi32       = _mm256_srai_epi32(a_lo_epi32_shift, 16);
    const __m256i b_hi_epi32       = _mm256_srai_epi32(b_epi16, 16);
    const __m256i b_lo_epi32_shift = _mm256_slli_epi32(b_epi16, 16);
    const __m256i b_lo_epi32       = _mm256_srai_epi32(b_lo_epi32_shift, 16);
    //
    // Convert to 32-bit floats
    //
    const __m256 a_hi = _mm256_cvtepi32_ps(a_hi_epi32);
    const __m256 a_lo = _mm256_cvtepi32_ps(a_lo_epi32);
    const __m256 b_hi = _mm256_cvtepi32_ps(b_hi_epi32);
    const __m256 b_lo = _mm256_cvtepi32_ps(b_lo_epi32);
    //
    // Calculate the reciprocal
    //
    const __m256 b_hi_rcp = _mm256_rcp_ps(b_hi);
    const __m256 b_lo_rcp = _mm256_rcp_ps(b_lo);
    //
    // Calculate the inverse
    //
    const __m256 b_hi_inv_1 = _mm256_fnmadd_ps(b_hi_rcp, b_hi, two);
    const __m256 b_lo_inv_1 = _mm256_fnmadd_ps(b_lo_rcp, b_lo, two);
    //
    // Compensate for the loss
    //
    const __m256 b_hi_rcp_1 = _mm256_mul_ps(b_hi_rcp, b_hi_inv_1);
    const __m256 b_lo_rcp_1 = _mm256_mul_ps(b_lo_rcp, b_lo_inv_1);
    //
    // Perform the division by multiplication
    //
    const __m256 hi = _mm256_mul_ps(a_hi, b_hi_rcp_1);
    const __m256 lo = _mm256_mul_ps(a_lo, b_lo_rcp_1);
    //
    // Convert back to integers
    //
    const __m256i hi_epi32 = _mm256_cvttps_epi32(hi);
    const __m256i lo_epi32 = _mm256_cvttps_epi32(lo);
    //
    // Blend the low and the high-parts
    //
    const __m256i hi_epi32_shift = _mm256_slli_epi32(hi_epi32, 16);
    return _mm256_blend_epi16(lo_epi32, hi_epi32_shift, 0xAA);
}

We can use the same method described above to perform validation of the code.

Performance

We can evaluate the performance using the measure flops per cycle (F/C). In this case scenario, we like to see how many divisions we can perform per cycle. For that purpose we define two vectors a and b and perform point-wise division. Both a and b are populated with random data using xorshift32, initialized with uint32_t state = 3853970173;

I use RDTSC to measure the cycles, performing 15 repetitions with warm cache, and using the median as a result. To avoid the effects of frequency scaling and resource sharing on the measurements, Turbo Boost and Hyper-Threading are disabled. To run the code I use Intel Xeon CPU E3-1285L v3 3.10GHz Haswell with 32GB of RAM and 25.6 GB/s bandwidth to main memory, running Debian GNU/Linux 8 (jessie), kernel 3.16.43-2+deb8u3. gcc used is 4.9.2-10. Results are given below:

Pure SSE2 Implementation

We compare plain division against the algorithm proposed above:

===============================================================
= Compiler & System info
===============================================================
Current CPU          : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID      : GNU
CXX Compiler Path    : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags   : -O3 -std=c++11 -msse2 -mno-fma

--------------------------------------------------------------------------------
|   Size    | Division F/C |  Division B/W   | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
|       128 |  0.5714286   |   26911.45 MB/s |  0.5019608  |   23634.21 MB/s   |
|       256 |  0.5714286   |   26909.17 MB/s |  0.5039370  |   23745.44 MB/s   |
|       512 |  0.5707915   |   26928.14 MB/s |  0.5039370  |   23763.79 MB/s   |
|      1024 |  0.5707915   |   26936.33 MB/s |  0.5039370  |   23776.85 MB/s   |
|      2048 |  0.5709507   |   26938.51 MB/s |  0.5039370  |   23780.25 MB/s   |
|      4096 |  0.5708711   |   26940.56 MB/s |  0.5039990  |   23782.65 MB/s   |
|      8192 |  0.5708711   |   26940.16 MB/s |  0.5039370  |   23781.85 MB/s   |
|     16384 |  0.5704735   |   26921.76 MB/s |  0.4954040  |   23379.24 MB/s   |
|     32768 |  0.5704537   |   26921.26 MB/s |  0.4954639  |   23382.13 MB/s   |
|     65536 |  0.5703147   |   26914.53 MB/s |  0.4943539  |   23330.13 MB/s   |
|    131072 |  0.5691680   |   26860.21 MB/s |  0.4929539  |   23264.40 MB/s   |
|    262144 |  0.5690618   |   26855.60 MB/s |  0.4929187  |   23262.22 MB/s   |
|    524288 |  0.5691378   |   26858.75 MB/s |  0.4929488  |   23263.56 MB/s   |
|   1048576 |  0.5677474   |   26794.14 MB/s |  0.4918968  |   23214.34 MB/s   |
|   2097152 |  0.5371243   |   25348.39 MB/s |  0.4700511  |   22183.07 MB/s   |
|   4194304 |  0.5128146   |   24200.82 MB/s |  0.4529809  |   21377.28 MB/s   |
|   8388608 |  0.5036971   |   23770.36 MB/s |  0.4438345  |   20945.84 MB/s   |
|  16777216 |  0.5005390   |   23621.14 MB/s |  0.4409909  |   20811.32 MB/s   |
|  33554432 |  0.4992792   |   23561.90 MB/s |  0.4399777  |   20763.49 MB/s   |
--------------------------------------------------------------------------------

We can observe how the plain division will be slightly faster than the proposed approximation step. In this case scenario, we can conclude that using SSE2 approximation will be suboptimal, on a Haswell microarchitecture.

However, if we run the same results on an older, Sandy Bridge machine, such as Intel(R) Xeon(R) CPU X5680 @ 3.33GHz, we can already see the benefit of the approximation:

===============================================================
= Compiler & System info
===============================================================
Current CPU          : Intel(R) Xeon(R) CPU X5680  @ 3.33GHz
CXX Compiler ID      : GNU
CXX Compiler Path    : /usr/bin/c++
CXX Compiler Version : 4.8.5
CXX Compiler Flags   : -O3 -std=c++11 -msse2 -mno-fma

--------------------------------------------------------------------------------
|   Size    | Division F/C |  Division B/W   | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
|       128 |  0.2857143   |   14511.41 MB/s |  0.3720930  |   18899.89 MB/s   |
|       256 |  0.2853958   |   14512.51 MB/s |  0.3715530  |   18898.91 MB/s   |
|       512 |  0.2853958   |   14510.53 MB/s |  0.3715530  |   18896.44 MB/s   |
|      1024 |  0.2853162   |   14511.81 MB/s |  0.3700759  |   18824.00 MB/s   |
|      2048 |  0.2853162   |   14511.04 MB/s |  0.3708130  |   18860.31 MB/s   |
|      4096 |  0.2852964   |   14511.16 MB/s |  0.3711826  |   18879.27 MB/s   |
|      8192 |  0.2852666   |   14510.23 MB/s |  0.3713172  |   18886.39 MB/s   |
|     16384 |  0.2852616   |   14509.86 MB/s |  0.3712920  |   18885.60 MB/s   |
|     32768 |  0.2852244   |   14507.93 MB/s |  0.3712709  |   18884.86 MB/s   |
|     65536 |  0.2851003   |   14501.41 MB/s |  0.3701114  |   18826.14 MB/s   |
|    131072 |  0.2850711   |   14499.95 MB/s |  0.3685017  |   18743.58 MB/s   |
|    262144 |  0.2850745   |   14500.47 MB/s |  0.3684799  |   18742.78 MB/s   |
|    524288 |  0.2848062   |   14486.66 MB/s |  0.3681040  |   18723.63 MB/s   |
|   1048576 |  0.2846679   |   14479.64 MB/s |  0.3671284  |   18674.02 MB/s   |
|   2097152 |  0.2840133   |   14446.52 MB/s |  0.3664623  |   18640.01 MB/s   |
|   4194304 |  0.2745241   |   13963.13 MB/s |  0.3488823  |   17745.24 MB/s   |
|   8388608 |  0.2741900   |   13946.39 MB/s |  0.3476036  |   17680.37 MB/s   |
|  16777216 |  0.2740689   |   13940.32 MB/s |  0.3477076  |   17685.97 MB/s   |
|  33554432 |  0.2746752   |   13970.75 MB/s |  0.3482017  |   17711.36 MB/s   |
--------------------------------------------------------------------------------

Would be even nicer to see how this would behave on even older machines say Nehalem (assuming it has RCP support).

SSE41 + FMA Implementation

We compare the plain division against the algorithm proposed above, enabling FMA and SSE41

===============================================================
= Compiler & System info
===============================================================
Current CPU          : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID      : GNU
CXX Compiler Path    : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags   : -O3 -std=c++11 -msse4.1 -mfma

--------------------------------------------------------------------------------
|   Size    | Division F/C |  Division B/W   | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
|       128 |  0.5714286   |   26884.20 MB/s |  0.5423729  |   25506.41 MB/s   |
|       256 |  0.5701559   |   26879.92 MB/s |  0.5412262  |   25503.95 MB/s   |
|       512 |  0.5701559   |   26904.68 MB/s |  0.5423729  |   25584.65 MB/s   |
|      1024 |  0.5704735   |   26911.46 MB/s |  0.5429480  |   25622.57 MB/s   |
|      2048 |  0.5704735   |   26915.03 MB/s |  0.5433802  |   25640.09 MB/s   |
|      4096 |  0.5703941   |   26917.72 MB/s |  0.5435965  |   25651.63 MB/s   |
|      8192 |  0.5703544   |   26915.85 MB/s |  0.5436687  |   25656.76 MB/s   |
|     16384 |  0.5699972   |   26898.44 MB/s |  0.5262583  |   24834.54 MB/s   |
|     32768 |  0.5699873   |   26898.93 MB/s |  0.5262076  |   24833.21 MB/s   |
|     65536 |  0.5698882   |   26894.48 MB/s |  0.5250567  |   24778.35 MB/s   |
|    131072 |  0.5697024   |   26885.50 MB/s |  0.5224302  |   24654.59 MB/s   |
|    262144 |  0.5696950   |   26884.72 MB/s |  0.5223095  |   24649.49 MB/s   |
|    524288 |  0.5696937   |   26885.37 MB/s |  0.5223308  |   24650.21 MB/s   |
|   1048576 |  0.5690340   |   26854.14 MB/s |  0.5220133  |   24634.71 MB/s   |
|   2097152 |  0.5455717   |   25746.56 MB/s |  0.5041949  |   23794.65 MB/s   |
|   4194304 |  0.5125461   |   24188.11 MB/s |  0.4756604  |   22447.05 MB/s   |
|   8388608 |  0.5043430   |   23800.67 MB/s |  0.4659974  |   21991.51 MB/s   |
|  16777216 |  0.5017375   |   23677.94 MB/s |  0.4614457  |   21776.58 MB/s   |
|  33554432 |  0.5005865   |   23623.50 MB/s |  0.4596277  |   21690.63 MB/s   |
--------------------------------------------------------------------------------

FMA + SSE4.1 does give us some level of improvements, but this is not good enough.

AVX2 + FMA implementation

Finally, we can see a real benefit comparing AVX2 plain division against the approximation method:

===============================================================
= Compiler & System info
===============================================================
Current CPU          : Intel(R) Xeon(R) CPU E3-1285L v3 @ 3.10GHz
CXX Compiler ID      : GNU
CXX Compiler Path    : /usr/bin/c++
CXX Compiler Version : 4.9.2
CXX Compiler Flags   : -O3 -std=c++11 -march=haswell

--------------------------------------------------------------------------------
|   Size    | Division F/C |  Division B/W   | Approx. F/C | Approximation B/W |
--------------------------------------------------------------------------------
|       128 |  0.5663717   |   26672.73 MB/s |  0.9481481  |   44627.89 MB/s   |
|       256 |  0.5651214   |   26653.72 MB/s |  0.9481481  |   44651.56 MB/s   |
|       512 |  0.5644983   |   26640.36 MB/s |  0.9463956  |   44660.99 MB/s   |
|      1024 |  0.5657459   |   26689.41 MB/s |  0.9552239  |   45044.21 MB/s   |
|      2048 |  0.5662151   |   26715.40 MB/s |  0.9624060  |   45405.33 MB/s   |
|      4096 |  0.5663717   |   26726.27 MB/s |  0.9671783  |   45633.64 MB/s   |
|      8192 |  0.5664500   |   26732.42 MB/s |  0.9688941  |   45724.83 MB/s   |
|     16384 |  0.5699377   |   26896.04 MB/s |  0.9092624  |   42909.11 MB/s   |
|     32768 |  0.5699675   |   26897.85 MB/s |  0.9087077  |   42883.21 MB/s   |
|     65536 |  0.5699625   |   26898.59 MB/s |  0.9001456  |   42480.91 MB/s   |
|    131072 |  0.5699253   |   26896.38 MB/s |  0.8926057  |   42124.09 MB/s   |
|    262144 |  0.5699117   |   26895.58 MB/s |  0.8928610  |   42137.13 MB/s   |
|    524288 |  0.5698622   |   26892.87 MB/s |  0.8928002  |   42133.63 MB/s   |
|   1048576 |  0.5685829   |   26833.13 MB/s |  0.8894302  |   41974.25 MB/s   |
|   2097152 |  0.5558453   |   26231.90 MB/s |  0.8371921  |   39508.55 MB/s   |
|   4194304 |  0.5224387   |   24654.67 MB/s |  0.7436747  |   35094.81 MB/s   |
|   8388608 |  0.5143588   |   24273.46 MB/s |  0.7185252  |   33909.08 MB/s   |
|  16777216 |  0.5107452   |   24103.19 MB/s |  0.7133449  |   33664.28 MB/s   |
|  33554432 |  0.5101245   |   24074.10 MB/s |  0.7125114  |   33625.03 MB/s   |
--------------------------------------------------------------------------------

Conclusion

This method can definitely provide a speed-up against plain division. How much speed up can actually be attained, it really depends on the underlying architecture, as well as how the division interacts with the rest of the application logic.

Alen Stojanov
  • 960
  • 8
  • 14
  • If you're checking for FMA support, you should probably also check for SSE4.1 and use `pmovzx` to sign-extend, at least for the low half of the 128-bit vector. BTW, SSE shift counts saturate, so I think you can `srai_epi16(a_epi16, 16)` and then `punpckl/h` with that, so sign-extend to 32-bit integers, and set up for re-packing with `packs/usdw`. (If you want truncation instead of saturation, mask first. shift/and/OR or shift / `pblendw` is more efficient than and/and/pack, but unpacking with shuffles might still be a win.) – Peter Cordes Jul 21 '18 at 22:04
  • Agreed. I tried to do a pure `SSE2` version and the `FMA` check was a left-over from a previous benchmark. Obviously, going upwards in regards to ISAs, one can do a better job than the initial code. I should probably edit the code and introduce `AVX2` version with benchmark, to have a complete answer. – Alen Stojanov Jul 22 '18 at 01:26
  • Forgot to say earlier: if division is part of a larger algorithm, [`divps` is often better than rcpps + Newton iteration for overall throughput on modern CPUs](https://stackoverflow.com/questions/4125033/floating-point-division-vs-floating-point-multiplication/45899202#45899202). (Other than KNL, where you're intended to use AVX512ER.) If you don't bottleneck on the not-fully-pipelined divider, 128-bit `divps` is just a single uop, vs. several. rcpps+Newton has similar latency to divps. OoO exec hides latency anyway if it's not loop-carried. – Peter Cordes Jul 22 '18 at 06:54
  • @PeterCordes - thanks for the suggestions. I have updated my answer adding performance profiles comparing the plain division against the approximation step. – Alen Stojanov Jul 23 '18 at 01:03
3

For 8bit division, It can be implemented by creating a magic number table.

See "Hacker's Delight", page 238

signed:

__m128i _mm_div_epi8(__m128i a, __m128i b)
{
    __m128i abs_b = _mm_abs_epi8(b);

    static const uint16_t magic_number_table[129] =
    {
        0x0000, 0x0000, 0x8080, 0x5580, 0x4040, 0x3380, 0x2ac0, 0x24c0, 0x2020, 0x1c80, 0x19c0, 0x1760, 0x1560, 0x13c0, 0x1260, 0x1120,
        0x1010, 0x0f20, 0x0e40, 0x0d80, 0x0ce0, 0x0c40, 0x0bb0, 0x0b30, 0x0ab0, 0x0a40, 0x09e0, 0x0980, 0x0930, 0x08e0, 0x0890, 0x0850,
        0x0808, 0x07d0, 0x0790, 0x0758, 0x0720, 0x06f0, 0x06c0, 0x0698, 0x0670, 0x0640, 0x0620, 0x05f8, 0x05d8, 0x05b8, 0x0598, 0x0578,
        0x0558, 0x0540, 0x0520, 0x0508, 0x04f0, 0x04d8, 0x04c0, 0x04b0, 0x0498, 0x0480, 0x0470, 0x0458, 0x0448, 0x0438, 0x0428, 0x0418,
        0x0404, 0x03f8, 0x03e8, 0x03d8, 0x03c8, 0x03b8, 0x03ac, 0x03a0, 0x0390, 0x0388, 0x0378, 0x0370, 0x0360, 0x0358, 0x034c, 0x0340,
        0x0338, 0x032c, 0x0320, 0x0318, 0x0310, 0x0308, 0x02fc, 0x02f4, 0x02ec, 0x02e4, 0x02dc, 0x02d4, 0x02cc, 0x02c4, 0x02bc, 0x02b4,
        0x02ac, 0x02a8, 0x02a0, 0x0298, 0x0290, 0x028c, 0x0284, 0x0280, 0x0278, 0x0274, 0x026c, 0x0268, 0x0260, 0x025c, 0x0258, 0x0250,
        0x024c, 0x0248, 0x0240, 0x023c, 0x0238, 0x0234, 0x022c, 0x0228, 0x0224, 0x0220, 0x021c, 0x0218, 0x0214, 0x0210, 0x020c, 0x0208,
        0x0202
    };

    Uint8 load_den[16];
    _mm_storeu_si128((__m128i*)load_den, abs_b);

    uint16_t mul[16];

    for (size_t i = 0; i < 16; i++)
    {
        uint16_t cur_den = load_den[i];
        mul[i] = magic_number_table[cur_den];
    }
    // for denominator 1, magic number is 0x10080 that 16bit-overflow occurs.
    __m128i one = _mm_set1_epi8(1);
    __m128i is_one = _mm_cmpeq_epi8(abs_b, one);

    // -128/-128 is a special case where magic number does not work.
    __m128i v80 = _mm_set1_epi8(0x80);
    __m128i is_80_a = _mm_cmpeq_epi8(a, v80);
    __m128i is_80_b = _mm_cmpeq_epi8(b, v80);
    __m128i is_80 = _mm_and_si128(is_80_a, is_80_b);

    // __m128i zero = _mm_setzero_si128();
    // __m128i less_a = _mm_cmpgt_epi8(zero, a);
    // __m128i less_b = _mm_cmpgt_epi8(zero, b);
    // __m128i  sign = _mm_xor_si128(less_a, less_b);
    __m128i abs_a = _mm_abs_epi8(a);
#if 0
    __m128i p = _mm_unpacklo_epi8(abs_a, zero);
    __m128i q = _mm_unpackhi_epi8(abs_a, zero);
    __m256i c = _mm256_castsi128_si256(p);
    c = _mm256_insertf128_si256(c, q, 1);
#else
    // Thanks to Peter Cordes
    __m256i c = _mm256_cvtepu8_epi16(abs_a);
#endif
    __m256i magic = _mm256_loadu_si256((const __m256i*)mul);
    __m256i high = _mm256_mulhi_epu16(magic, c);
    __m128i v0h = _mm256_extractf128_si256(high, 0);
    __m128i v0l = _mm256_extractf128_si256(high, 1);
    __m128i res = _mm_packus_epi16(v0h, v0l);
    __m128i div = _mm_blendv_epi8(res, abs_a, is_one);
    // __m128i neg = _mm_sub_epi8(zero, div);
    // __m128i select = _mm_blendv_epi8(div, neg, sign);
    __m128i select = _mm_sign_epi8(div, _mm_or_si128(_mm_xor_si128(a, b), one));
    return _mm_blendv_epi8(select, one, is_80);
}

unsigned:

__m128i _mm_div_epu8(__m128i n, __m128i den)
{
    static const uint16_t magic_number_table[256] =
    {
        0x0001, 0x0000, 0x8000, 0x5580, 0x4000, 0x3340, 0x2ac0, 0x04a0, 0x2000, 0x1c80, 0x19a0, 0x0750, 0x1560, 0x13c0, 0x0250, 0x1120,
        0x1000, 0x0f10, 0x0e40, 0x0d80, 0x0cd0, 0x0438, 0x03a8, 0x0328, 0x0ab0, 0x0a40, 0x09e0, 0x0980, 0x0128, 0x00d8, 0x0890, 0x0048,
        0x0800, 0x07c8, 0x0788, 0x0758, 0x0720, 0x06f0, 0x06c0, 0x0294, 0x0668, 0x0640, 0x021c, 0x05f8, 0x05d8, 0x01b4, 0x0194, 0x0578,
        0x0558, 0x013c, 0x0520, 0x0508, 0x04f0, 0x04d8, 0x04c0, 0x04a8, 0x0094, 0x0480, 0x006c, 0x0458, 0x0448, 0x0034, 0x0024, 0x0014,
        0x0400, 0x03f4, 0x03e4, 0x03d4, 0x03c8, 0x03b8, 0x03ac, 0x039c, 0x0390, 0x0384, 0x0378, 0x036c, 0x0360, 0x0354, 0x014a, 0x0340,
        0x0334, 0x032c, 0x0320, 0x0318, 0x010e, 0x0304, 0x02fc, 0x02f4, 0x02ec, 0x02e4, 0x02dc, 0x02d4, 0x02cc, 0x02c4, 0x02bc, 0x02b4,
        0x02ac, 0x02a4, 0x02a0, 0x0298, 0x0290, 0x028c, 0x0284, 0x007e, 0x0278, 0x0072, 0x026c, 0x0066, 0x0260, 0x025c, 0x0254, 0x0250,
        0x004a, 0x0244, 0x0240, 0x023c, 0x0036, 0x0032, 0x022c, 0x0228, 0x0224, 0x001e, 0x001a, 0x0016, 0x0012, 0x000e, 0x000a, 0x0006,
        0x0200, 0x00fd, 0x01fc, 0x01f8, 0x01f4, 0x01f0, 0x01ec, 0x01e8, 0x01e4, 0x01e0, 0x01dc, 0x01d8, 0x01d6, 0x01d4, 0x01d0, 0x01cc,
        0x01c8, 0x01c4, 0x01c2, 0x01c0, 0x01bc, 0x01b8, 0x01b6, 0x01b4, 0x01b0, 0x01ae, 0x01ac, 0x01a8, 0x01a6, 0x01a4, 0x01a0, 0x019e,
        0x019c, 0x0198, 0x0196, 0x0194, 0x0190, 0x018e, 0x018c, 0x018a, 0x0188, 0x0184, 0x0182, 0x0180, 0x017e, 0x017c, 0x017a, 0x0178,
        0x0176, 0x0174, 0x0172, 0x0170, 0x016e, 0x016c, 0x016a, 0x0168, 0x0166, 0x0164, 0x0162, 0x0160, 0x015e, 0x015c, 0x015a, 0x0158,
        0x0156, 0x0154, 0x0152, 0x0051, 0x0150, 0x014e, 0x014c, 0x014a, 0x0148, 0x0047, 0x0146, 0x0144, 0x0142, 0x0140, 0x003f, 0x013e,
        0x013c, 0x013a, 0x0039, 0x0138, 0x0136, 0x0134, 0x0033, 0x0132, 0x0130, 0x002f, 0x012e, 0x012c, 0x012a, 0x0029, 0x0128, 0x0126,
        0x0025, 0x0124, 0x0122, 0x0021, 0x0120, 0x001f, 0x011e, 0x011c, 0x001b, 0x011a, 0x0019, 0x0118, 0x0116, 0x0015, 0x0114, 0x0013,
        0x0112, 0x0110, 0x000f, 0x010e, 0x000d, 0x010c, 0x000b, 0x010a, 0x0009, 0x0108, 0x0007, 0x0106, 0x0005, 0x0104, 0x0003, 0x0102
    };

    static const uint16_t shift_table[256] =
    {
        0x0001, 0x0100, 0x0100, 0x0080, 0x0100, 0x0040, 0x0040, 0x0020, 0x0100, 0x0080, 0x0020, 0x0010, 0x0020, 0x0040, 0x0010, 0x0020,
        0x0100, 0x0010, 0x0040, 0x0080, 0x0010, 0x0008, 0x0008, 0x0008, 0x0010, 0x0040, 0x0020, 0x0080, 0x0008, 0x0008, 0x0010, 0x0008,
        0x0100, 0x0008, 0x0008, 0x0008, 0x0020, 0x0010, 0x0040, 0x0004, 0x0008, 0x0040, 0x0004, 0x0008, 0x0008, 0x0004, 0x0004, 0x0008,
        0x0008, 0x0004, 0x0020, 0x0008, 0x0010, 0x0008, 0x0040, 0x0008, 0x0004, 0x0080, 0x0004, 0x0008, 0x0008, 0x0004, 0x0004, 0x0004,
        0x0100, 0x0004, 0x0004, 0x0004, 0x0008, 0x0008, 0x0004, 0x0004, 0x0010, 0x0004, 0x0008, 0x0004, 0x0020, 0x0004, 0x0002, 0x0040,
        0x0004, 0x0004, 0x0020, 0x0008, 0x0002, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004, 0x0004,
        0x0004, 0x0004, 0x0020, 0x0008, 0x0010, 0x0004, 0x0004, 0x0002, 0x0008, 0x0002, 0x0004, 0x0002, 0x0020, 0x0004, 0x0004, 0x0010,
        0x0002, 0x0004, 0x0040, 0x0004, 0x0002, 0x0002, 0x0004, 0x0008, 0x0004, 0x0002, 0x0002, 0x0002, 0x0002, 0x0002, 0x0002, 0x0002,
        0x0100, 0x0001, 0x0004, 0x0008, 0x0004, 0x0010, 0x0004, 0x0008, 0x0004, 0x0020, 0x0004, 0x0008, 0x0002, 0x0004, 0x0010, 0x0004,
        0x0008, 0x0004, 0x0002, 0x0040, 0x0004, 0x0008, 0x0002, 0x0004, 0x0010, 0x0002, 0x0004, 0x0008, 0x0002, 0x0004, 0x0020, 0x0002,
        0x0004, 0x0008, 0x0002, 0x0004, 0x0010, 0x0002, 0x0004, 0x0002, 0x0008, 0x0004, 0x0002, 0x0080, 0x0002, 0x0004, 0x0002, 0x0008,
        0x0002, 0x0004, 0x0002, 0x0010, 0x0002, 0x0004, 0x0002, 0x0008, 0x0002, 0x0004, 0x0002, 0x0020, 0x0002, 0x0004, 0x0002, 0x0008,
        0x0002, 0x0004, 0x0002, 0x0001, 0x0010, 0x0002, 0x0004, 0x0002, 0x0008, 0x0001, 0x0002, 0x0004, 0x0002, 0x0040, 0x0001, 0x0002,
        0x0004, 0x0002, 0x0001, 0x0008, 0x0002, 0x0004, 0x0001, 0x0002, 0x0010, 0x0001, 0x0002, 0x0004, 0x0002, 0x0001, 0x0008, 0x0002,
        0x0001, 0x0004, 0x0002, 0x0001, 0x0020, 0x0001, 0x0002, 0x0004, 0x0001, 0x0002, 0x0001, 0x0008, 0x0002, 0x0001, 0x0004, 0x0001,
        0x0002, 0x0010, 0x0001, 0x0002, 0x0001, 0x0004, 0x0001, 0x0002, 0x0001, 0x0008, 0x0001, 0x0002, 0x0001, 0x0004, 0x0001, 0x0002
    };

    static const uint16_t mask_table[256] =
    {
        0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0xffff, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0xffff, 0x0000, 0xffff,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0xffff, 0x0000,
        0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0xffff, 0xffff,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000,
        0xffff, 0x0000, 0x0000, 0x0000, 0xffff, 0xffff, 0x0000, 0x0000, 0x0000, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff,
        0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
        0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000,
        0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000, 0x0000, 0xffff, 0x0000, 0x0000,
        0xffff, 0x0000, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0x0000, 0xffff, 0x0000, 0xffff,
        0x0000, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000, 0xffff, 0x0000
    };

    uint8_t load_den[16];
    _mm_storeu_si128((__m128i*)load_den, den);

    uint16_t mul[16];
    uint16_t mask[16];
    uint16_t shift[16];

    for (size_t i = 0; i < 16; i++)
    {
        const uint16_t cur_den = load_den[i];
        mul[i] = magic_number_table[cur_den];
        mask[i] = mask_table[cur_den];
        shift[i] = shift_table[cur_den];
    }
#if 0
    __m128i a = _mm_unpacklo_epi8(n, _mm_setzero_si128());
    __m128i b = _mm_unpackhi_epi8(n, _mm_setzero_si128());
    __m256i c = _mm256_castsi128_si256(a);
    c = _mm256_insertf128_si256(c, b, 1);
#else
    // Thanks to Peter Cordes
    __m256i c = _mm256_cvtepu8_epi16(n);
#endif
    __m256i magic = _mm256_loadu_si256((const __m256i*)mul);
    __m256i high = _mm256_mulhi_epu16(magic, c);
    __m256i low = _mm256_mullo_epi16(magic, c);
    __m256i low_down = _mm256_srli_epi16(low, 8);
    __m256i high_up = _mm256_slli_epi16(high, 8);
    __m256i low_high = _mm256_or_si256(low_down, high_up);
    __m256i target_up = _mm256_mullo_epi16(c, _mm256_loadu_si256((const __m256i*)shift));
    __m256i cal1 = _mm256_sub_epi16(target_up, low_high);
    __m256i cal2 = _mm256_srli_epi16(cal1, 1);
    __m256i cal3 = _mm256_add_epi16(cal2, low_high);
    __m256i cal4 = _mm256_srli_epi16(cal3, 7);
    __m256i res = _mm256_blendv_epi8(high, cal4, _mm256_loadu_si256((const __m256i*)mask));

    __m128i v0h = _mm256_extractf128_si256(res, 0);
    __m128i v0l = _mm256_extractf128_si256(res, 1);

    return _mm_packus_epi16(v0h, v0l);
}
sugwan kim
  • 31
  • 3
  • 2
    If you're going to use AVX2, you can and should use [VPMOVZXBW](http://felixcloutier.com/x86/PMOVZX.html) to create `c` with one instruction instead of unpacking lo/hi separately and shuffling that together. `_mm256_cvtepu8_epi16` – Peter Cordes Dec 01 '18 at 18:30
  • 1
    You can also optimize the sign calculation: since you only need the high bit to be correct for `blendv`, you can `__m128i sign = _mm_xor_si128(a,b);`. Or even better, conditionally negate using [`vpsignb`](http://felixcloutier.com/x86/PSIGNB:PSIGNW:PSIGND.html). Except that `a^b` will be zero for `a==b`, where the result should be +1 not 0. So `(a^b) | 1` will work, because you already have a `set1(1)` vector that will make it non-zero without affecting the sign bit. So `__m128i select = _mm_sign_epi8(div, a^b | one);`. (`vblendv` is 2 uops and only runs on the shuffle port on Haswell) – Peter Cordes Dec 01 '18 at 19:24
  • @Peter Thank you for your advice. – sugwan kim Dec 01 '18 at 20:27
1

this is for new comers thanks to the original solution: this Agner Fog's subroutine library has done the magic with me in optimization

here's the situation when you divide on the same variable value multiple times (like in big loop)

#include <asmlib.h>

unsigned int a, b, d;
unsigned int divisor = any_random_value;
div_u32 OptimumDivision(divisor);
a/OptimumDivision;
b/OptimumDivision;

that's for unsigned int - if you need negative value usediv_i32 instead which was faster in my tests even if the manual say the opposite

I get around 3x performance or more

Mohamad Elnaqeeb
  • 471
  • 4
  • 13