0

I've been trying to optimize some simple code and I try two kind of optimizations, loop enrolling and memory aliasing.
My original code:

int paint(char *dst, unsigned n, char *src, char bias)
{
    unsigned i;
    for (i=0;i<n;i++) {
        *dst++ = bias + *src++;
    }
    return 0;
}

My optimizated code after loop enrolling:

int paint(char *dst, unsigned n, char *src, char bias)
{
    unsigned i;
    for (i=0;i<n;i+=2) {
       *dst++ = bias + *src++;
        *dst++ = bias + *src++;
    }
    return 0;
}

How after this I can optimize the code with memory aliasing? And there are another good optimizations for this code? (Like cast the pointers to long pointers to copy quickly)

phuclv
  • 27,258
  • 11
  • 104
  • 360
Xavi
  • 206
  • 6
  • re: doing a `long` at a time *safely*, see [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) for GNU C `typedef` with an `__attribute__((aligned(1),may_alias))` type. (That could make SWAR possible, if you know / assume the `+` won't carry-out into the next element. But even better to let the compiler auto-vectorize with proper SIMD for most modern targets that have HW support for that.) – Peter Cordes Mar 06 '21 at 02:26

2 Answers2

1

Optimization in C is easier than this.

cc -Wall -W -pedantic -O3 -march=native -flto source.c

That will unroll any loops that need to be unrolled. Doing your own unrolling, Duff's Device and other tricks are outdated and pretty useless.

As for aliasing, your function uses two char* parameters. If they are guaranteed to never point into the same arrays then you can use the restrict keyword. That will allow the optimizer to assume more things about the code and use vectorized instructions.

Check out the assembly produced here: https://godbolt.org/z/xMfebr or https://godbolt.org/z/j1xMYz

Can you manage to do all of that by hand? Probably not.

Zan Lynx
  • 49,393
  • 7
  • 74
  • 125
  • 1
    Unfortunately no, `gcc -O3` won't unroll unless you also compile with profile-guided optimization (`-fprofile-generate` / run it / `-fprofile-use`). But it will auto-vectorize with SSE or AVX2 or whatever. Note the inner loop in your Godbolt link (https://godbolt.org/z/xMfebr) is `.L4:`, just one `paddb` per iteration, in a loop with 5 uops so it won't even run at 1 cycle per iteration on CPUs before Ice Lake / Zen. (And Ice Lake could be doing 2x 32-byte vector stores per iteration, so this loop only runs half speed vs. what's possible with just SSE2 on Ice Lake). – Peter Cordes Mar 06 '21 at 02:16
  • 1
    OTOH, clang *will* unroll small loops by default, unlike GCC. (GCC will unwisely fully-unroll the scalar cleanup, ironically spending most of the code-size of the function optimizing startup / cleanup, not the main loop.) But yes, unrolling by 2 in the scalar C source is at best useless in a loop that can vectorize, at worst can defeat auto-vectorization like it does here for clang. – Peter Cordes Mar 06 '21 at 02:20
  • @PeterCordes Yeah the second godbolt link is with clang and set for AVX-512. – Zan Lynx Mar 06 '21 at 17:53
1

Are you only concerned about performance? What about correctness?

Judging by the name of your function paint and the variable bias (and using my crystal ball), I guess you need to add with saturation (in case of overflow). This can be dune by using intrinsics for paddusb (https://www.felixcloutier.com/x86/paddusb:paddusw): https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=774,433,4179,4179&cats=Arithmetic&text=paddusb

Vlad Feinstein
  • 7,028
  • 1
  • 8
  • 18