bitwise type convertion with AVX2 and range preservation

Question

I want to convert a vector of signed char into a vector of unsigned char. I want to preserve the value range for each type.

I mean the value range of signed char is -128 and +127 when the value range of an unsigned char element is between 0 - 255.

Without intrinsics I can do this almost like that :

#include <iostream>

int main(int argc,char* argv[])
{

typedef signed char schar;
typedef unsigned char uchar;

schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

uchar b[32] = {0};

    for(int i=0;i<32;i++)
        b[i] = 0xFF & ~(0x7F ^ a[i]);

    return 0;

}

So using AVX2 I wrote the following program :

#include <immintrin.h>
#include <iostream>

int main(int argc,char* argv[])
{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);

    _a = _mm256_xor_si256(_a,_cst1);
    _a = _mm256_andnot_si256(_cst2,_a);

// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
    _b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
    _a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

    // convert from epi16 -> epu8.
    _b = _mm256_packus_epi16(_b,_a);

_mm256_stream_si256(reinterpret_cast<__m256i*>(b),_b);

return 0;
}

When I display the varaible b it was fully empty. I check also the following situations :

   #include <immintrin.h>
    #include <iostream>

    int main(int argc,char* argv[])

{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);


// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
    _b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
    _a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

    // convert from epi16 -> epu8.
    _b = _mm256_packus_epi16(_b,_a);

_b = _mm256_xor_si256(_b,_cst1);
_b = _mm256_andnot_si256(_cst2,_b);


_mm256_stream_si256(reinterpret_cast<__m256i*>(b),_b);

return 0;
}

and :

 #include <immintrin.h>
    #include <iostream>

    int main(int argc,char* argv[])

{
    schar a[]={-1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,-128,19,20,21,22,23,24,25,26,27,28,29,30,31,32};

     uchar b[32] = {0};

    __m256i _a = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(a));
    __m256i _b;
    __m256i _cst1 = _mm256_set1_epi8(0x7F);
    __m256i _cst2 = _mm256_set1_epi8(0xFF);


// The way I do the convertion is inspired by an algorithm from OpenCV. 
// Convertion from epi8 -> epi16
_b = _mm256_srai_epi16(_mm256_unpacklo_epi8(_mm256_setzero_si256(),_a),8);
_a = _mm256_srai_epi16(_mm256_unpackhi_epi8(_mm256_setzero_si256(),_a),8);

_a = _mm256_xor_si256(_a,_cst1);
_a = _mm256_andnot_si256(_cst2,_a);

_b = _mm256_xor_si256(_b,_cst1);
_b = _mm256_andnot_si256(_cst2,_b);

_b = _mm256_packus_epi16(_b,_a);

_mm256_stream_si256(reinterpret_cast<__m256i*>(b[0]),_b);

return 0;
}

My investigation show me a part of the issue is related to the and_not operation. But I don't find why.

The variable b should contain the following sequence : [127, 126, 125, 132, 133, 134, 121, 120, 137, 138, 117, 140, 141, 142, 143, 144, 145, 0, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160].

Thanks in advance for any help.

can you explain in more detail what you mean by "I want to preserve the value range" . For example what would the signed char value `-2` transform to? — M.M, Feb 04 '16 at 08:54
@M.M: I think he means as opposed to `abs()`, or saturating negatives to 0 or something. From the last paragraph, "b should contain", we can see that he just wants to add 128. — Peter Cordes, Feb 04 '16 at 09:26
I compiled a program similar to your original code that did not use intrinsics, and the clang/llvm optimizer was clever enough to rewrite the code to use avx instructions to do this with packed operations. Are you sure you can actually do a better job that your compiler here? — jcoder, Feb 04 '16 at 10:37

score 0 · Accepted Answer · answered Feb 04 '16 at 03:16

Yeah, the "andnot" definitely looks sketchy. Since _cst2 values are set to 0xFF, this operation will AND your _b vector with zero. I think you mixed up the order of arguments. It's the first argument that gets inverted. See the reference.

I don't understand the rest of the guff with conversions etc either. You just need this:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_xor_si256( _a, _mm256_set1_epi8( 0x7f ) );
_b = _mm256_andnot_si256( _b, _mm256_set1_epi8( 0xff ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

An alternative solution is to just add 128, but I'm not certain of the implications of overflow in this case:

__m256i _a, _b;
_a = _mm256_stream_load_si256( reinterpret_cast<__m256i*>(a) );
_b = _mm256_add_epi8( _a, _mm256_set1_epi8( 0x80 ) );
_mm256_stream_si256( reinterpret_cast<__m256i*>(b), _b );

One final important thing is that your a and b arrays MUST have 32-byte alignment. If you are using C++11 you can use alignas:

alignas(32) signed char a[32] = { -1,-2,-3,4,5,6,-7,-8,9,10,-11,12,13,14,15,16,17,
                                 -128,19,20,21,22,23,24,25,26,27,28,29,30,31,32 };
alignas(32) unsigned char b[32] = {0};

Otherwise you will need to use non-aligned load and store instructions, i.e. _mm256_loadu_si256 and _mm256_storeu_si256. But those don't have the same non-temporal cache properties as the stream instructions.

Hello Thank you very much for you answer. I did a convertion to unsigned short because I wasn't sure if I could make every without changing the type. You are also right about the alignement. Thank you very much for your help :) — John_Sharp1318, Feb 04 '16 at 04:03
NT loads from normal (writeback) memory are not helpful, so it's unlikely the OP actually needs them, but you're right that streaming stores do have to be aligned. Also, adding `0x80` is correct, I checked. It does exactly the same thing as subtracting `0x80` (`-128`). So converting on the fly is even cheaper: it can be done with a `vpaddb dest, ymm7, m256` instruction when the compiler folds the load into the add as a memory operand. (subtracting in the other order wouldn't work.) — Peter Cordes, Feb 04 '16 at 09:18

score 0 · Answer 2 · edited May 23 '17 at 12:07

You're just talking about adding 128 to each byte, right? That shifts the range from [-128..127] to [0..255]. The trick for adding 128 when you can only use 8bit operands is to subtract -128.

However, adding 0x80 works as well, when the result is truncated to 8 bits. (because of two's complement). Adding is good, because it doesn't matter which order the operands are in, so the compiler can use a load-and-add instruction (folding the memory operand into the load).

Adding/subtracting -128, with the carry/borrow stopped by the element boundary, is equivalent to xor (aka carryless add). Using pxor could be a small advantage on Intel Core2 through Broadwell, since Intel must have thought it was worth it to add paddb/w/d/q hardware on port0 for Skylake (giving them one per 0.333c throughput like pxor). (Thanks @harold for pointing this out). Both instructions only require SSE2.

XOR is also potentially useful for SWAR unaligned cleanup, or for SIMD architectures that don't have a byte-size add/subtract operation.

You shouldn't use _a for your variable name. _ names are reserved. I tend to use names like veca or va, and preferably something more descriptive for temporaries. (Like a_unpacked).

__m256i signed_bytes = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(a));
__m256i unsigned_bytes = _mm256_add_epi8(signed_bytes, _mm256_set1_epi8(-128));

Yes, it's that simple, you don't need two's-complement bithacks. For one thing, your way needs two separate 32B masks, which increases your cache footprint. (But see What are the best instruction sequences to generate vector constants on the fly? You (or the compiler) could generate the vector of -128 bytes using 3 instructions, or a broadcast-load from a 4B constant.)

Only use _mm256_stream_load_si256 for I/O (e.g. reading from video RAM). Don't use it for reading from "normal" (writeback) memory; it doesn't do what you think it does. (I don't think it has any particular downside, though. It just works like a normal vmovdqa load). I put some links about that in another answer I wrote recently.

Streaming stores are useful to normal (writeback) memory regions. However, they're a good idea only if you're not going to read that memory again any time soon. If that's the case, you should probably do this conversion from signed to unsigned on the fly in the code that reads this data, because it's super-cheap. Just keep your data in one format or the other, and convert on the fly in code that needs it the other way. Only needing one copy of it in cache is a huge win compared to saving one instruction in some loops.

Also google "cache blocking" (aka loop tiling) and read about optimizing your code to work in small chunks to increase computational density. (Do as much stuff as possible with data while it's in cache.)

Good to know. Thank you for the informations. Actually this is code is only an experiment. The code will be use for process images that why I use the instruction _mm256_stream_load_si256. The aim is to move signed char to unsigned char in order to process an histogram. But I'll take a closer look to your post an the cache blocking. — John_Sharp1318, Feb 04 '16 at 13:21
I just think isn't it better to do a substration of 128 rather than an addition ? I mean : `__m256i unsigned_bytes = _mm256_sub_epi8(signed_bytes, _mm256_set1_epi8(128));` rather than : `__m256i unsigned_bytes = _mm256_add_epi8(signed_bytes, _mm256_set1_epi8(-128));` On the intel intrinsics guide web page it's wrote there no Throughput for the substration while there is a Throughput of 0.5 for the addition. — John_Sharp1318, Feb 04 '16 at 13:51
@Jonny_S: So are you literally using video driver APIs to map USWC video RAM into your process? If not, then don't use stream_load. And also, convert between signed and unsigned ranges on the fly in your histogram code. **re: add vs. sub**: `vpaddb` and `vpsubb` have identical throughput, latency, and execution unit requirements on all CPUs, because that's the only sane HW design. IDK why the intrinsics guide lists it with a `-`, but that doesn't mean it has unlimited throughput! Look at http://agner.org/optimize/ for better instruction tables (and how to understand the implications). — Peter Cordes, Feb 04 '16 at 17:03
You could also xor with -128, which has a higher throughput on Haswell (probably doesn't matter much in this context, but well) — harold, Feb 05 '16 at 16:56
@PeterCordes no they can't, or at least, I have them listed as p15 on SnB, IvB and Haswell, they only became p015 in Skylake — harold, Feb 06 '16 at 09:18
@harold: thanks, IDK what I was looking at before, but I rechecked the table and you're right. I thought I looked at more than just the Skylake page in the spreadsheet before making a generalization about the whole SnB-family. >.< Thanks again for the xor suggestion. — Peter Cordes, Feb 06 '16 at 09:36

bitwise type convertion with AVX2 and range preservation

2 Answers2

Linked