Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

Question

I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication?

The only 128-bit operations are OR, XOR and shift. Add and Subtract top out at 64-bits and the newer multiply allows up to 32-bits. In order to implement a 128-bit addition you would need to manually deal with the carry flag and lose all performance benefit of doing it in the first place. — BitBank, Aug 30 '12 at 15:54
@BitBank: There is AND, and ANDNOT too, but your point is still valid - there are no 128 bit *arithmetic* operations in SSE2. — Paul R, Aug 30 '12 at 16:01
Technically, you *can*. But there are no non-bitwise instructions to do so. So you'd have to emulate everything - at which point it isn't gonna be any better than just using carry-flags on x64... — Mysticial, Aug 30 '12 at 16:28
Thank you for your answers ( well, comments! ) very much, a pity, for a second I thought we were already holding 128-bit processors in our hands. But, by any chance, do any later versions of SSE have these functions in 128-bits? — Erkling, Aug 31 '12 at 13:22

phuclv · Accepted Answer · 2020-08-11T14:02:33.947

SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here

Here is an untested, unoptimized C code based on the idea above:

inline bool lessthan(__m128i a, __m128i b){
    a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a, b);
    __m128i u = _mm_cmpgt_epi32(a, b);
    __m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

inline __m128i addi128(__m128i a, __m128i b)
{
    __m128i sum = _mm_add_epi64(a, b);
    __m128i mask = _mm_set1_epi64(0x8000000000000000);    
    if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
    {
        __m128i ONE = _mm_setr_epi64(0, 1);
        sum = _mm_add_epi64(sum, ONE);
    }

    return sum;
}

As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86)

SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

1 Answers1

Linked

Related