Vectorize random init and print for BigInt with decimal digit array, with AVX2?

Question

How could I pass my code to AVX2 code and get the same result as before?

Is it possible to use __m256i in the LongNumInit, LongNumPrint functions instead of uint8_t *L, or some similar type of variable?

My knowledge of AVX is quite limited; I investigated quite a bit however I do not understand very well how to transform my code any suggestion and explanation is welcome.

I'm really interested in this code in AVX2.

void LongNumInit(uint8_t *L, size_t N )
{
  for(size_t i = 0; i < N; ++i){
      L[i] = myRandom()%10;
  }

}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
  printf("%s:", Name);
  for ( size_t i=N; i>0;--i )
  {
    printf("%d", L[i-1]);
  }
  printf("\n");
}
int main (int argc, char **argv)
{
  int i, sum1, sum2, sum3, N=10000, Rep=50;

  seed = 12345;

  // obtain parameters at run time
  if (argc>1) { N    = atoi(argv[1]); }
  if (argc>2) { Rep  = atoi(argv[2]); }
  
 // Create Long Nums
  unsigned char *V1= (unsigned char*) malloc( N);
  unsigned char *V2= (unsigned char*) malloc( N);
  unsigned char *V3= (unsigned char*) malloc( N);
  unsigned char *V4= (unsigned char*) malloc( N);

  LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );
   
//Print last 32 digits of Long Numbers
  LongNumPrint( V1, 32, "V1" );
 LongNumPrint( V2, 32, "V2" );
  LongNumPrint( V3, 32, "V3" );
  LongNumPrint( V4, 32, "V4" );

  free(V1); free(V2); free(V3); free(V4);
  return 0;
}

The result that I obtain in my initial code is this:

V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000

As in https://stackoverflow.com/questions/61062060/why-compiler-doesnt-vectorize-this-function please first explain why you try to implementing big-integer arithmetic with single decimal digits. — chtz, Apr 12 '20 at 00:55
If you want this to be fast, don't use BCD / base 10. I wrote a CodeReveiw.SE answer about this: [BigInt class in C++](https://codereview.stackexchange.com/a/237764) — Peter Cordes, Apr 12 '20 at 01:34
Carry propagation makes it hard to vectorize. See [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) for ways to store your data that make it practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling. — Peter Cordes, Apr 12 '20 at 01:40

Peter Cordes · Accepted Answer · 2021-03-18T21:10:20.703

This is a terrible format for BigInteger in general, see https://codereview.stackexchange.com/a/237764 for a code review of the design flaws in using one decimal digit per byte for BigInteger, and what you could/should do instead.

And see Can long integer routines benefit from SSE? for @Mysticial's notes on ways to store your data that make SIMD for BigInteger math practical, specifically partial-word arithmetic where your temporaries might not be "normalized", letting you do lazy carry handling.

But apparently you're just asking about this code, the random-init and print functions, not how to do math between two numbers in this format.

We can vectorize both of these quite well. My LongNumPrintName() is a drop-in replacement for yours.

For LongNumInit I'm just showing a building-block that stores two 32-byte chunks and returns an incremented pointer. Call it in a loop. (It naturally produces 2 vectors per call so for small N you might make an alternate version.)

`LongNumInit`

What's the fastest way to generate a 1 GB text file containing random digits? generates space-separated random ASCII decimal digits at about 33 GB/s on 4GHz Skylake, including overhead of write() system calls to /dev/null. (This is higher than DRAM bandwidth; cache blocking for 128kiB lets the stores hit in L2 cache. The kernel driver for /dev/null doesn't even read the user-space buffer.)

It could easily be adapted into an AVX2 version of void LongNumInit(uint8_t *L, size_t N ). My answer there uses an AVX2 xorshift128+ PRNG (vectorized with 4 independent PRNGs in the 64-bit elements of a __m256i) like AVX/SSE version of xorshift128+. That should be similar quality of randomness to your rand() % 10.

It breaks that up into decimal digits via a multiplicative inverse to divide and modulo by 10 with shifts and vpmulhuw, using Why does GCC use multiplication by a strange number in implementing integer division?. (Actually using GNU C native vector syntax to let GCC determine the magic constant and emit the multiplies and shifts for convenient syntax like v16u dig1 = v % ten; and v /= ten;)

You can use _mm256_packus_epi16 to pack two vectors of 16-bit digits into 8-bit elements instead of turning the odd elements into ASCII ' ' and the even elements into ASCII '0'..'9'. (So change vec_store_digit_and_space to pack pairs of vectors instead of ORing with a constant, see below)

Compile this with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel's intrinsics).

See https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for the __attribute__((vector_size(32))) part, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for the _mm256_* stuff. Also https://stackoverflow.com/tags/sse/info.

#include <immintrin.h>

// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));

// returns p + size of stores.  Caller should use outpos = f(vec, outpos)
// p must be aligned
__m256i* vec_store_digits(__m256i vec, __m256i *restrict p)
{
    v16u v = (v16u)vec;
    v16u ten = (v16u)_mm256_set1_epi16(10);

    v16u divisor = (v16u)_mm256_set1_epi16(6554);  // ceil((2^16-1) / 10.0)
    v16u div6554 = v / divisor;      // Basically the entropy from the upper two decimal digits: 0..65.
    // Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
    // dig4 for more ILP and fewer instructions total.

    v16u dig1 = v % ten;
    v /= ten;
    v16u dig2 = v % ten;
    v /= ten;
    v16u dig3 = v % ten;
    //  dig4 would overlap much of the randomness that div6554 gets

    // __m256i or v16u assignment is an aligned store
    v16u *vecbuf = (v16u*)p;
      // pack 16->8 bits.
    vecbuf[0] = _mm256_packus_epi16(div6554, dig1);
    vecbuf[1] = _mm256_packus_epi16(dig2, dig3)
    return p + 2;  // always a constant number of full vectors
}

The logic in random_decimal_fill_buffer that inserts newlines can be totally removed because you just want a flat array of decimal digits. Just call the above function in a loop until you've filled your buffer.

Handling small sizes (less than a full vector):

It would be convenient to pad your malloc up to the next multiple of 32 bytes so it's always safe to do a 32-byte load without checking for maybe crossing into an unmapped page.

And use C11 aligned_alloc to get 32-byte aligned storage. So for example, aligned_alloc(32, (size+31) & -32). This lets us just do full 32-byte stores even if N is odd. Logically only the first N bytes of the buffer hold our real data, but it's convenient to have padding we can scribble over to avoid any extra conditional checks for N being less than 32, or not a multiple of 32.

Unfortunately ISO C and glibc are missing aligned_realloc and aligned_calloc. MSVC does actually provide those: Why is there no 'aligned_realloc' on most platforms? allowing you to sometimes allocate more space at the end of an aligned buffer without copying it. A "try_realloc" would be ideal for C++ that might need to run copy-constructors if non-trivially copyable objects change address. Non-expressive allocator APIs that force sometimes-unnecessary copying is a pet peeve of mine.

`LongNumPrint`

Taking a uint8_t *Name arg is bad design. If the caller wants to printf a "something:" string first, they can do that. Your function should just do what printf "%d" does for an int.

Since you're storing your digits in reverse printing order, you'll want to byte-reverse into a tmp buffer and convert 0..9 byte values to '0'..'9' ASCII character values by ORing with '0'. Then pass that buffer to fwrite.

Specifically, use alignas(32) char tmpbuf[8192]; as a local variable.

You can work in fixed-size chunks (like 1kiB or 8kiB) instead allocating a potentially-huge buffer. You probably want to still go through stdio (instead of write() directly and managing your own I/O buffering). With an 8kiB buffer, an efficient fwrite might just pass that on to write() directly instead of memcpy into the stdio buffer. You might want to play around with tuning this, but keeping the tmp buffer comfortably smaller than half of L1d cache will mean it's still hot in cache when it's re-read after you wrote it.

Cache blocking makes the loop bounds a lot more complex but it's worth it for very large N.

Byte-reversing 32 bytes at a time:

You could avoid this work by deciding that your digits are stored in MSD-first order, but then if you did want to implement addition it would have to loop from the end backwards.

The your function could be implemented with SIMD _mm_shuffle_epi8 to reverse 16-byte chunks, starting from the end of you digit array and writing to the beginning of your tmp buffer.

Or better, load vmovdqu / vinserti128 16-byte loads to feed _mm256_shuffle_epi8 to byte-reverse within lanes, setting up for 32-byte stores.

On Intel CPUs, vinserti128 decodes to a load+ALU uop, but it can run on any vector ALU port, not just the shuffle port. So two 128-bit loads are more efficient than 256-bit load -> vpshufb - > vpermq which would probably bottleneck on shuffle-port throughput if data was hot in cache. Intel CPUs can do up to 2 loads + 1 store per clock cycle (or in IceLake, 2 loads + 2 stores). We'll probably bottleneck on the front-end if there are no memory bottlenecks, so in practice not saturating load+store and shuffle ports. (https://agner.org/optimize/ and https://uops.info/)

This function is also simplified by the assumption that we can always read 32 bytes from L without crossing into an unmapped page. But after a 32-byte reverse for small N, the first N bytes of the input become the last N bytes in a 32-byte chunk. It would be most convenient if we could always safely do a 32-byte load ending at the end of a buffer, but it's unreasonable to expect padding before the object.

#include <immintrin.h>
#include <stdalign.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>

// one vector of 32 bytes of digits, reversed and converted to ASCII
static inline
void ASCIIrev32B(void *dst, const void *src)
{
    __m128i hi = _mm_loadu_si128(1 + (const __m128i*)src);  // unaligned loads
    __m128i lo = _mm_loadu_si128(src);
    __m256i v = _mm256_set_m128i(lo, hi);    // reverse 128-bit hi/lo halves

    // compilers will hoist constants out of inline functions
    __m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);      
    __m256i byterev = _mm256_broadcastsi128_si256(byterev_lane);  // same in each lane
    v = _mm256_shuffle_epi8(v, byterev);               // in-lane reverse

    v = _mm256_or_si256(v, _mm256_set1_epi8('0'));     // digits to ASCII
    _mm256_storeu_si256(dst, v);                       // Will usually be aligned in practice.
}

// Tested for N=32; could be bugs in the loop bounds for other N
// returns bytes written, like fwrite: N means no error, 0 means error in all fwrites
size_t LongNumPrint( uint8_t *num, size_t N)
{
    // caller can print a name if it wants

    const int revbufsize = 8192;      // 8kiB on the stack should be fine
    alignas(32) char revbuf[revbufsize];

    if (N<32) {
        // TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages
        ASCIIrev32B(revbuf, num);   // the data we want is at the *end* of a 32-byte reverse
        return fwrite(revbuf+32-N, 1, N, stdout);
    }

    size_t bytes_written = 0;
    const uint8_t *inp = num+N;  // start with last 32 bytes of num[]
    do {
        size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num;

        const uint8_t *inp_stop = inp - chunksize + 32;   // leave one full vector for the end
        uint8_t *outp = revbuf;
        while (inp > inp_stop) {        // may run 0 times
            inp -= 32;
            ASCIIrev32B(outp, inp);
            outp += 32;
        }
        // reverse first (lowest address) 32 bytes of this chunk of num
        // into last 32 bytes of this chunk of revbuf
        // if chunksize%32 != 0 this will overlap, which is fine.
        ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32);
        bytes_written += fwrite(revbuf, 1, chunksize, stdout);
        inp = inp_stop - 32;
    } while ( inp > num );

    return bytes_written;
    // caller can putchar('\n') if it wants
}


// wrapper that prints name and newline
void LongNumPrintName(uint8_t *num, size_t N, const char *name)
{
    printf("%s:", name);
    //LongNumPrint_scalar(num, N);
    LongNumPrint(num, N);
    putchar('\n');
}

// main() included on Godbolt link that runs successfully

This compiles and runs (on Godbolt) with gcc -O3 -march=haswell and produces identical output to your scalar loop for the N=32 that main passes. (I used rand() instead of MyRandom(), so we could test with the same seed and get the same numbers, using your init function.)

Untested for larger N, but the general idea of chunksize = min(ptrdiff, 8k) and using that to loop downwards from the end of num[] should be solid.

We could load (not just store) aligned vectors if we converted the first N%32 bytes and passed that to fwrite before starting the main loop. But that probably either leads to an extra write() system call, or to clunky copying inside stdio. (Unless there was already buffered text not printed yet, like Name:, in which case we already have that penalty.)

Note that it's technically C UB to decrement inp past start of num. So inp -= 32 instead of inp = inp_stop-32 would have that UB for the iteration that leaves the outer loop. I actually avoid that in this version, but it generally works anyway because I think GCC assumes a flat memory model and de-factor defines the behaviour of pointer compares enough. And normal OSes reserve the zero page so num definitely can't be within 32 bytes of the start of physical memory (so inp can't wrap to a high address.) This paragraph is mostly left-over from the first totally untested attempt that I thought was decrementing the pointer farther in the inner loop than it actually was.

First of all thank you very much for your response. I've tried to apply what you've explained to me, however if I haven't misunderstood in the vector I have 4 64-bit elements. Therefore in the LogNumPrint function instead of uint8_t would not be uint64_t.Okay, it's possible that I don't quite understand well how to make the function call could give me an example of main to use these functions? — Desiree Pasion Rodriguez Sanch, Apr 14 '20 at 16:17
@DesireePasionRodriguezSanch: A 32-byte vector can be 32x 8-bit, 16x 16-bit, 8x 32-bit, or 4x 64-bit elements, depending on which instructions (or intrinsics you use). For example, `_mm256_add_epi64` does 64-bit addition, `_mm256_add_epi8` does 32x 8-bit addition. It's only a matter of whether carry propagates between byte/word/dword boundaries or not. Or for a shuffle like `_mm256_shuffle_epi8`, it's treating the vector as byte elements. — Peter Cordes, Apr 14 '20 at 16:39
@Desiree - My `LongNumPrint( uint8_t *num, size_t N)` is a drop-in replacement for yours (except for the last arg). The vectorization is wrapped up in `void ASCIIrev32B(void *dst, const void *src)`. Your `main` can still just call it. e.g. https://godbolt.org/z/RP87XP shows it working on Godbolt with your `main`. — Peter Cordes, Apr 14 '20 at 16:59
When trying to execute this it fails me, it tells me illegal instruction. This could be due? — Desiree Pasion Rodriguez Sanch, Apr 14 '20 at 23:13
@DesireePasionRodriguezSanch: Then your CPU doesn't support AVX2. Compiling with `-march=haswell` tells the compiler it can assume the CPU running the program has AVX2, BMI2, etc. You could make a 128-bit version of this that only requires SSSE3 or AVX1; byte-reverse is simpler with 128-bit vectors. — Peter Cordes, Apr 14 '20 at 23:51
When I compile with march=haswell I get the same error again. — Desiree Pasion Rodriguez Sanch, Apr 15 '20 at 00:05
@DesireePasionRodriguezSanch: Sorry, I didn't explain that clearly. This code uses AVX2 instructions without checking if the CPU supports them. Yours doesn't. (GCC won't compile AVX2 intrinsics unless you tell it to make code for a CPU that *does* support AVX2, e.g. by using `-march=haswell` or `-mavx2`. If you compile with `-O3 -march=native` to enable everything your CPU supports, but no more, then this code won't compile. Or possibly some other ISA extension was a problem, but I think AMD CPUs that support AVX2 support all the extensions that Haswell has that GCC would use.) — Peter Cordes, Apr 15 '20 at 00:17
@DesireePasionRodriguezSanch: TL:DR: use `gcc -O3 -march=native`. If you get a `target specific option mismatch` compile error ([like this](//stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti)), your CPU definitely doesn't support AVX2. Making a `__m128i` version of this code should be straightforward, though. What CPU do you have? — Peter Cordes, Apr 15 '20 at 00:18
@DesireePasionRodriguezSanch: I'm curious what worked for your system. Did you have to convert the code to use only `__m128i`, or did `-march=native` work after all where `-march=haswell` didn't? Or did you just run it on a new-enough CPU? — Peter Cordes, Apr 15 '20 at 16:57

Vectorize random init and print for BigInt with decimal digit array, with AVX2?

1 Answers1

`LongNumInit`

Handling small sizes (less than a full vector):

`LongNumPrint`

Linked