11

I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication?

jww
  • 83,594
  • 69
  • 338
  • 732
Erkling
  • 509
  • 3
  • 16
  • 2
    The only 128-bit operations are OR, XOR and shift. Add and Subtract top out at 64-bits and the newer multiply allows up to 32-bits. In order to implement a 128-bit addition you would need to manually deal with the carry flag and lose all performance benefit of doing it in the first place. – BitBank Aug 30 '12 at 15:54
  • 1
    @BitBank: There is AND, and ANDNOT too, but your point is still valid - there are no 128 bit *arithmetic* operations in SSE2. – Paul R Aug 30 '12 at 16:01
  • 7
    Technically, you *can*. But there are no non-bitwise instructions to do so. So you'd have to emulate everything - at which point it isn't gonna be any better than just using carry-flags on x64... – Mysticial Aug 30 '12 at 16:28
  • Thank you for your answers ( well, comments! ) very much, a pity, for a second I thought we were already holding 128-bit processors in our hands. But, by any chance, do any later versions of SSE have these functions in 128-bits? – Erkling Aug 31 '12 at 13:22

1 Answers1

10

SIMD is meant to work on multiple small values at the same time, hence there won't be any carry over to the higher unit and you must do that manually. In SSE2 there's no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this. Worse yet, SSE2 doesn't have 64-bit comparisons either, so you must use some workaround like the one here

Here is an untested, unoptimized C code based on the idea above:

inline bool lessthan(__m128i a, __m128i b){
    a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b, _mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a, b);
    __m128i u = _mm_cmpgt_epi32(a, b);
    __m128i z = _mm_or_si128(t, _mm_shuffle_epi32(t, 177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u, 245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

inline __m128i addi128(__m128i a, __m128i b)
{
    __m128i sum = _mm_add_epi64(a, b);
    __m128i mask = _mm_set1_epi64(0x8000000000000000);    
    if (lessthan(_mm_xor_si128(mask, sum), _mm_xor_si128(mask, a)))
    {
        __m128i ONE = _mm_setr_epi64(0, 1);
        sum = _mm_add_epi64(sum, ONE);
    }

    return sum;
}

As you can see, the code requires many more instructions and even after optimizing it may still be much longer than a simple 2 ADD/ADC pair in x86_64 (or 4 instructions in x86)


SSE2 will help though, if you have multiple 128-bit integers to add in parallel. However you need to arrange the high and low parts of the values properly so that we can add all the low parts at once, and all the high parts at once

See also

phuclv
  • 27,258
  • 11
  • 104
  • 360
  • You can do 64-bit equality comparison with SSE4.1, but I still don't think it will be any faster than simple scalar code – phuclv Apr 07 '14 at 15:22
  • 1
    I think you mean PCMPGTQ (Compare Packed Signed 64-bit data For Greater Than) from SSE4.2 not SSE4.1. – Z boson Mar 02 '15 at 15:33