I'm investigating potential speedups with respect to Constant time rotate that does not violate the standards.
A rotate on x86/x64 has the following. For simplicity, I'm going to discuss rotating a byte (so we don't get tangled in immediate-8 versus 16, 32 or 64):
- The "value" can be in a register or in memory
- The "count" can be in a register or an immediate
The processor expects the count
to be in CL
when using a register. The processor performs the rotate by masking all but the lower 5 bits of count
.
Below, the value
is x
, and the count
is y
.
template<> inline byte rotLeft<byte>(byte x, unsigned int y)
{
__asm__ __volatile__("rolb %b1, %0" : "=mq" (x) : "cI" (y), "0" (x));
return x;
}
Since x
is both read and write, I think I should be using a +
somewhere. But I can't get the assembler to take it.
My question is, are the constraints represented correctly?
EDIT: based on Jester's feedback, the function was changed to:
template<> inline byte rotLeft<byte>(byte x, unsigned int y)
{
__asm__ __volatile__("rolb %b1, %0" : "+mq" (x) : "cI" (y));
return x;
}
References: