3

I'm investigating potential speedups with respect to Constant time rotate that does not violate the standards.

A rotate on x86/x64 has the following. For simplicity, I'm going to discuss rotating a byte (so we don't get tangled in immediate-8 versus 16, 32 or 64):

  • The "value" can be in a register or in memory
  • The "count" can be in a register or an immediate

The processor expects the count to be in CL when using a register. The processor performs the rotate by masking all but the lower 5 bits of count.

Below, the value is x, and the count is y.

template<> inline byte rotLeft<byte>(byte x, unsigned int y)
{
    __asm__ __volatile__("rolb %b1, %0" : "=mq" (x) : "cI" (y), "0" (x));
    return x;
}

Since x is both read and write, I think I should be using a + somewhere. But I can't get the assembler to take it.

My question is, are the constraints represented correctly?


EDIT: based on Jester's feedback, the function was changed to:

template<> inline byte rotLeft<byte>(byte x, unsigned int y)
{
    __asm__ __volatile__("rolb %b1, %0" : "+mq" (x) : "cI" (y));
    return x;
}

References:

Community
  • 1
  • 1
jww
  • 83,594
  • 69
  • 338
  • 732
  • 2
    Since you specified `x` separately as input and output, you don't need the `+`. – Jester Jul 16 '15 at 22:49
  • Thanks Jester. I don't need a ***`+`*** - got it. Is the read portion of ***`x`*** OK? Or should it be ***`"mq0" (x)`***. Or should I omit it in favor of ***`"+mq" (x)`***? (The inline assembler syntax still confounds me at times). – jww Jul 16 '15 at 22:55
  • 1
    `"0"` is ok, means same place as output `0`. The `"+mq" (x)` makes more sense here I think. The splitting is useful if you can put the output into a separate place (either constraint or variable) but you don't do that here. – Jester Jul 16 '15 at 23:03
  • 1
    Sick hint : if you're _really_ **really** optimizing for size, you could make a template specialization for cases when the shift value is constant 1, and use the `ROL r/m8, 1` (`0xD0`) instruction. Saves one byte! – Daniel Kamil Kozar Jul 16 '15 at 23:08
  • 3
    @DanielKamilKozar no need, the `I` constraint already allows for a constant and the assembler will optimize away the `,1`. – Jester Jul 16 '15 at 23:17
  • 1
    I would not use the 'volatile' keyword here. If this template is invoked from someplace where the output is not used (think: #if), the volatile will force the rotate to be done anyway, even if the result is then discarded. – David Wohlferd Jul 17 '15 at 01:40
  • There is an idiom that avoids undefined behaviour, and that gets gcc/clang/icc to generate a single `rol` insn. See my answer to http://stackoverflow.com/a/31488147/224132. (Linkin here for the benefit of anyone that finds this question looking for a rotate, rather than looking for gcc asm constraints.) – Peter Cordes Jul 18 '15 at 06:10

1 Answers1

3

You should use the correct sized type for operands rather trying to force the register to the correct size using an operand modifer. In this case this also will truncate the immediate operand to the correct size if it's too big. Also as David Wohlferd said, you don't want to make the asm statement volatile as this would prevent the optimizer from removing it if it's unused.

template<> inline byte rotLeft<byte>(byte x, unsigned int y)
{
     asm ("rolb %1, %0" : "+mq" (x) : "cI" ((byte)y));
     return x;
}
Ross Ridge
  • 35,323
  • 6
  • 64
  • 105
  • Thanks Ross. Using just the cast on `y`, I get a warning, *"asm operand 1 probably doesn't match constraint"*. Then I get an error, *"impossible constraint in asm"*. – jww Jul 17 '15 at 03:05
  • My bad, I see why this happened... There's a `rotLeftImm` that uses the constraint `"I"`. Sorry about that. – jww Jul 17 '15 at 03:33