segmentation fault(core dumped) error while using inline assembly

Question

I'm using inline assembly in GCC. I want to rotate a variable content 2 bits to the left (I moved the variable to the rax register and then rotate it 2 times). I wrote the code below but I faced segmentation fault(core dumped) error. I would be grateful if you could help me.

uint64_t X = 13835058055282163712U;
 asm volatile(
            "movq %0 , %%rax\n"
            "rol %%rax\n"
            "rol %%rax\n"
            :"=r"(X)
            :"r"(X)
         );
printf("%" PRIu64 "\n" , X);

You destroy `%rax` without telling the compiler, and you're falsely assuming that the compiler will pick the same register for separate input and output constraints. Look at compiler output for the function to see what the compiler generated around your code. Also note that `asm("rol $2, %0" : "+r"(X))` would do it in one instruction. More importantly, you can do this more efficiently with pure C that optimizes to one rotate instruction, and doesn't block constant-propagation optimization: [Best practices for circular shift (rotate) operations in C++](//stackoverflow.com/q/776508) — Peter Cordes, Feb 15 '20 at 09:36
not exactly a duplicate of [Why can't local variable be used in GNU C basic inline asm statements?](//stackoverflow.com/q/60227941) but has similar mistakes as the extended asm there. Answers have links to docs. And more importantly, explain the design philosophy of GNU C inline asm. — Peter Cordes, Feb 15 '20 at 09:41

Chris Hall · Accepted Answer · 2020-02-16T00:35:51.080

The key to understanding inline asm is to understand that each asm statement has two parts:

The text of the actual assembler stuff, in which the compiler will make textual substitutions, but does not understand.

This is the AssemblerTemplate in the documentation (everything up to the first : in the __asm__()).
A description of what the assembler stuff does, in terms that the compiler does understand.

This the : OutputOperands : InputOperands : Clobbers in the documentation.

This must tell the compiler how the assembler fits in with all the code which the compiler is generating around it. The code generation is busy allocating registers to hold values, deciding what order to do things in, moving things out of loops, eliminating unused fragments of code, discarding values it no longer needs, and so on.

The actual assembler is a black box which takes the inputs described here, produces the outputs described and as a side effect may 'clobber' some registers and/or memory. This must be a complete description of what the assembler does... otherwise the compiler-generated asm around your template will clash with it and rely on false assumptions.

Armed with this information the compiler can decide what registers the assembler can use, and you should let it do that.

So, your fragment:

 asm volatile(
            "movq %0 , %%rax\n"
            "rol %%rax\n"
            "rol %%rax\n"
            :"=r"(X)
            :"r"(X)
         );

has a few "issues":

you may have chosen %rax for the result on the basis that the asm() is like a function, and may be expected to return a result in %rax -- but that isn't so.
you went ahead and used %rax, which the compiler may (well) already have allocated to something else... so you are, effectively, 'clobbering' %rax but you have failed to tell the compiler about it !
you specified =r(X) (OutputOperand) which tells the compiler to expect an output in some register and that output will be the new value of the variable X. The %0 in the AssemblerTemplate will be replaced by the register selected for the output. Sadly, your assembly treats %0 as the input :-( And the output is, in fact, in %rax -- which, as above, the compiler is unaware of.
you also specified r(X) (InputOperand) which tells the compiler to arrange for the current value of the variable X to be placed in some register for the assembler to use. This would be %1 in the AssemblerTemplate. Sadly, your assembly does not use this input.

Even though the output and input operands both refer to X, the compiler may not make %0 the same register as %1. (This allows it to use the asm block as a non-destructive operation that leaves the original value of an input unmodified. If this isn't how your template works, don't write it that way.
generally you don't need the volatile when all the inputs and outputs are properly described by constraints. One of the fine things the compiler will do is discard an asm() if (all of) the output(s) are not used... volatile tells the compiler not to do that (and tells it a number of other things... see the manual).

Apart from that, things were wonderful. The following is safe, and avoids a mov instruction:

 asm("rol %0\n"
     "rol %0\n"   : "+r"(X));

where "+r"(X) says that there is one combined input and output register required, taking the old value of X and returning a new one.

Now, if you don't want to replace X, then assuming Y is to be the result, you could:

 asm("mov %1, %0\n"
     "rol %0\n"
     "rol %0\n"   : "=r"(Y) : "r"(X));

But it's better to leave it up to the compiler to decide whether it needs to mov or whether it can just let an input be destroyed.

There are a couple of rules about InputOperands which are worth mentioning:

The assembler must not overwrite any of the InputOperands -- the compiler is tracking which values it has in which registers, and is expecting InputOperands to be preserved.
The compiler expects all InputOperands to be read before any OutputOperand is written. This is important when the compiler knows that a given InputOperand is not used again after the asm(), and it can therefore allocate the InputOperand's register to an OutputOperand. There is a thing called earlyclobber (=&r(foo)) to deal with this little wrinkle.

In the above, if you don't in fact use X again the compiler could allocate %0 and %1 to the same register! But the (redundant) mov will still be assembled -- remembering that the compiler really doesn't understand the AssemblerTemplate. So, you are in general better off shuffling values around in the C, not the asm(). See https://gcc.gnu.org/wiki/DontUseInlineAsm and Best practices for circular shift (rotate) operations in C++

So here are four variations on a theme, and the code generated (gcc -O2):

// (1) uses both X and Y in the printf() -- does mov %1, %0 in asm()
void Never_Inline footle(void)               Dump of assembler code for function footle:
{                                              mov    $0x492782,%edi   # address of format string
  unsigned long  X, Y ;                        xor    %eax,%eax
                                               mov    $0x63,%esi       # X = 99
  X = 99 ;                                     rol    %rsi            # 1st asm
  __asm__("\t rol  %0\n"                       rol    %rsi
          "\t rol  %0\n" : "+r"(X)             mov    %rsi,%rdx       # 2nd asm, compiler using it as a copy-and-rotate
      ) ;                                      rol    %rdx
                                               rol    %rdx
  __asm__("\t mov  %1, %0\n"                   jmpq   0x4010a0 <printf@plt>  # tailcall printf
          "\t rol  %0\n"
          "\t rol  %0\n" : "=r"(Y) : "r"(X)
      ) ;

  printf("%lx %lx\n", X, Y) ;
}

// (2) uses both X and Y in the printf() -- does Y = X in 'C'
void Never_Inline footle(void)               Dump of assembler code for function footle:
{                                              mov    $0x492782,%edi
  unsigned long  X, Y ;                        xor    %eax,%eax
                                               mov    $0x63,%esi
  X = 99 ;                                     rol    %rsi       # 1st asm
  __asm__("\t rol  %0\n"                       rol    %rsi
          "\t rol  %0\n" : "+r"(X)             mov    %rsi,%rdx  # compiler-generated mov
      ) ;                                      rol    %rdx       # 2nd asm
                                               rol    %rdx
  Y = X ;                                      jmpq   0x4010a0 <printf@plt>
  __asm__("\t rol  %0\n"
          "\t rol  %0\n" : "+r"(Y)
      ) ;

  printf("%lx %lx\n", X, Y) ;
}

// (3) uses only Y in the printf() -- does mov %1, %0 in asm()
void Never_Inline footle(void)               Dump of assembler code for function footle:
{                                              mov    $0x492782,%edi
  unsigned long  X, Y ;                        xor    %eax,%eax
                                               mov    $0x63,%esi
  X = 99 ;                                     rol    %rsi
  __asm__("\t rol  %0\n"                       rol    %rsi
          "\t rol  %0\n" : "+r"(X)             mov    %rsi,%rsi   # redundant instruction because of mov in the asm template
      ) ;                                      rol    %rsi
                                               rol    %rsi
  __asm__("\t mov  %1, %0\n"                   jmpq   0x4010a0 <printf@plt>
          "\t rol  %0\n"
          "\t rol  %0\n" : "=r"(Y) : "r"(X)
      ) ;

  printf("%lx\n", Y) ;
}

// (4) uses only Y in the printf() -- does Y = X in 'C'
void Never_Inline footle(void)               Dump of assembler code for function footle:
{                                              mov    $0x492782,%edi
  unsigned long  X, Y ;                        xor    %eax,%eax
                                               mov    $0x63,%esi
  X = 99 ;                                     rol    %rsi
  __asm__("\t rol  %0\n"                       rol    %rsi
          "\t rol  %0\n" : "+r"(X)             rol    %rsi    # no wasted mov, compiler picked %0=%1=%rsi
      ) ;                                      rol    %rsi
                                               jmpq   0x4010a0 <printf@plt>
  Y = X ;
  __asm__("\t rol  %0\n"
          "\t rol  %0\n" : "+r"(Y)
      ) ;

  printf("%lx\n", Y) ;
}

which, hopefully, demonstrates the compiler busily allocating values to registers, tracking which values it needs to hold on to, minimizing register/register moves, and generally being clever.

So the trick is to work with the compiler, understanding that the :OutputOperands:InputOperands:Clobbers is where you are describing what the assembler is doing.

*otherwise the compiler will become confused !* The compiler isn't "confused", it's blissfully unaware when the constraints lie to it about what the template does (including by omission). I like to describe it as "stepping on the compiler's toes" in the delicate dance between your asm and compiler-generated asm. — Peter Cordes, Feb 15 '20 at 19:17
*Also, sadly,* That's not a "sad" thing; it lets you tell the compiler when it can use the asm statement as a copy-and-modify. (So for example if X happened to be sharing a register with another value beforehand, like after `long X = y;` the compiler doesn't need to emit a `mov` outside your asm statement even if `y` is needed later.) It's a wrong assumption made by this code, but I wouldn't describe it as sad. — Peter Cordes, Feb 15 '20 at 19:22
BTW, this answer would be even better if you pointed out that CPUs since 186 (including all x86-64) allow an immediate operand for ROL, so there is zero point in using two 1-bit rol instructions. (There is a tiny different in FLAG setting between implicit-1 ROL vs. immediate ROL (https://www.felixcloutier.com/x86/rcl:rcr:rol:ror), but there's no `setcc` or GCC6 condition-code output constraint to observe that.) — Peter Cordes, Feb 15 '20 at 19:25
RORX isn't faster than `ROR r64, imm8`; both are a single-uop for port 0 or 6 with 1 cycle latency on Haswell and later. (https://www.uops.info/table.html). The only advantage is being non-destructive so you (or the compiler) can avoid a `mov`. (Well, and not modifying FLAGS could let the compiler arrange code around a cmov or setcc). `ror` by *implicit* 1 (the form being used here) is 2 uops on Sandybridge-family, though, because it has to set OF while leaving other flags in the SPAZO renaming group unaffected. But of course assemblers "optimize" `rol $1, %rax` to the implicit-1 form :/ — Peter Cordes, Feb 15 '20 at 20:16
BTW, we can tell that Intel's docs are subtly wrong: a masked count of 1 for the immediate form still won't update OF, unless it decodes differently in that case. I assume it will for the CL form, which is 3 uops. uops.info's tests of immediate-count rotates are all with `imm8` = 0 or 2, e.g. https://www.uops.info/html-tp/SKL/ROR_R32_I8-Measurements.html. And BTW, Zen runs all rotates as a single uop (except RORX with a memory source), so yes, only Intel is ever slow with rotates. — Peter Cordes, Feb 15 '20 at 20:22
@PeterCordes: you are right, of course... it's SHLX which is faster than the older SHL %cl. But I intended this to help with how to write inline asm in particular, not x86 or x86_64 assembler in general. — Chris Hall, Feb 16 '20 at 00:39
Ah, yes, SHLX is a big win for variable shifts. (Fortunately the extra 2 uops aren't on the count -> output critical path, only for FLAGS). Still, interesting that `rol $2, %rax` is actually 1/4 the uops of 2x `rol %rax`, not just half. rorx would be useful for that case if you don't want to fight with your assembler to get it to use the immediate (instead of implicit) encoding for 1. — Peter Cordes, Feb 16 '20 at 00:43

segmentation fault(core dumped) error while using inline assembly

1 Answers1

Linked

Related