8

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.

Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?

Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.

float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
   f32_0 = vld1q_f32(&s[x]);
   f32_1 = vld1q_f32(&s[x+4]);
   f32_2 = vld1q_f32(&s[x+8]);
   f32_3 = vld1q_f32(&s[x+12]);
   __builtin_prefetch(&s[x+64]);
   f32_0 = vnegq_f32(f32_0);
   f32_1 = vnegq_f32(f32_1);
   f32_2 = vnegq_f32(f32_2);
   f32_3 = vnegq_f32(f32_3);
   vst1q_f32(&d[x], f32_0);
   vst1q_f32(&d[x+4], f32_1);
   vst1q_f32(&d[x+8], f32_2);
   vst1q_f32(&d[x+12], f32_3);
} 

This is the code it generates:

vld1.32 {d18-d19}, [r5]
vneg.f32  q9,q9        <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11   <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8

More info:

  • GCC 4.9.2 for Raspbian
  • compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon

When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.

Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.

BitBank
  • 8,004
  • 3
  • 23
  • 42
  • I have no idea. Consider reporting a compiler bug or writing assembly directly. Generally, intrinsics are handled like ordinary builtin functions. There is no guarantee that the compiler emits instructions in the same order in which wrote intrinsics. – fuz Jan 20 '16 at 13:55
  • 4
    Useful information would be the version of GCC you are using along with the CPU you are tuning for ( -mcpu=??? or -mtune=??? ). In general the answer is, because GCC believes that the interleaving it is using results in better processor utilization than the interleaving you asked for. Another question would be, how are you detecting that there are stalls? – James Greenhalgh Jan 20 '16 at 13:56
  • @James see above for compiler info. I've written the asm code in my style and it's faster than GCC's code. The target CPU is a Cortex-A7. – BitBank Jan 20 '16 at 14:05
  • The Raspberry Pi supports Neon?! Colour me surprised. – fuz Jan 20 '16 at 14:07
  • how about trying to disable GCCs optimizaiton first? – coredump Jan 20 '16 at 14:09
  • 1
    @coredump - disabling optimization makes it produce even slower code that still messes up the intrinsics. – BitBank Jan 20 '16 at 14:12
  • 1
    If you want to write assembly, write assembly. It's more readable anyway. – Stephen Canon Jan 20 '16 at 14:17
  • @StephenCanon - I would be happy to write ASM code, but as I said, for portability (32/64-bit ARM) and maintainability by future developers who don't know assembly language, intrinsics were chosen. – BitBank Jan 20 '16 at 14:19
  • 3
    Try an explicit `-mcpu=cortex-a7` to change the instruction scheduling model the compiler is using. If you want a more extreme flag to try you can ask GCC not to try any instruction scheduling at all with `-fno-schedule-insns -fno-schedule-insns2` . – James Greenhalgh Jan 20 '16 at 14:20
  • Well, `-mtune=cortex-a7` (Linaro GCC 5.1) makes the output look much like the input... – Notlikethat Jan 20 '16 at 14:28
  • 1
    @JamesGreenhalgh Thanks for the suggestions. Setting -mcpu=cortex-a7 made it generate better code that interleaved instructions much better and sped things up. The -fno-schedule options both made the output slower. – BitBank Jan 20 '16 at 14:28
  • gcc just follows the C standard which allows to optimise code strict in accordance to the abstract machine. Before complaing the compiler does his job, you should first inform yourself what it is allowed to do. If you want full control, use Assembler! And what makes you think neon extensions are more portable than other instructions? Note that as you are heading for speed, ARM64 has quite a different pipeline and inner structure anyway, so all optimisation is CPU-specific anyway (even between different ARMv7A cores thre are some differences). – too honest for this site Jan 20 '16 at 14:34
  • @JamesGreenhalgh - FYI I'm also working with a Dragonboard 410c and ARM64 Linux. I've seen some compiler behavior that led me to believe there may be bugs in the ARM64 version of GCC. If you wouldn't mind, contact me to find out more (bitbank@pobox.com). – BitBank Jan 20 '16 at 14:42
  • @BitBank If it is possible for you, bad (or unexpected) GCC behaviour is best reported through the general GCC development mailing lists gcc@gcc.gnu.org or by raising a bug on the GCC Bugzilla https://gcc.gnu.org/bugzilla/ . It gets more eyes on the question than just mine, and the archived responses are useful to point to in future :-). – James Greenhalgh Jan 20 '16 at 14:44
  • I can't catch up with all the reading but, if needed can't you put `asm volatile("");` between each intrinsic? that would stop moving things around. – auselen Jan 20 '16 at 14:45
  • @Olaf - ARM wanted backwards compatibility for their intrinsics, so even though the instruction mnemonics changed for Aarch64, the original set of intrinsics compile properly on 32-bit and 64-bit ARM compilers. That's what I meant by 32/64-bit compatibility. – BitBank Jan 20 '16 at 14:56
  • You missed the actual point. – too honest for this site Jan 20 '16 at 15:06
  • @BitBank have you found solution to your problem? New GCC version introduced better code? – killdaclick Dec 31 '19 at 15:19
  • @killdaclick - the newer versions of GCC on ARM have gotten better, but I personally mostly use LLVM and it is always ahead of GCC in terms of compiler quality. It hasn't been an issue for the projects I've worked on recently. – BitBank Jan 01 '20 at 16:07

1 Answers1

10

Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.

To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.

Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.

So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.

To improve on this you can try:

  • Explicitly setting the processor you are targeting (-mcpu=cortex-a7)
  • Disabling instruction scheduling entirely (`-fno-schedule-insns -fno-schedule-insns2)

Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.

Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

James Greenhalgh
  • 2,274
  • 16
  • 14