Near constant time rotate that does not violate the standards

Question

I'm having a heck of a time trying to come up with a constant time rotate that does not violate the C/C++ standards.

The problem is the edge/corner cases, where operations are called out in algorithms and those algorithms cannot be changed. For example, the following is from Crypto++ and executes the test harness under GCC ubsan (i.e., g++ fsanitize=undefined):

$ ./cryptest.exe v | grep runtime
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:625:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'

And the code at misc.h:637:

template <class T> inline T rotlMod(T x, unsigned int y)
{
    y %= sizeof(T)*8;
    return T((x<<y) | (x>>(sizeof(T)*8-y)));
}

Intel's ICC was particularly ruthless, and it removed the entire function call with out the y %= sizeof(T)*8. We fixed that a few years back, but left the other errata in-place due to lack of a constant time solution.

There's one pain point remaining. When y = 0, I get a condition where 32 - y = 32, and it sets up the undefined behavior. If I add a check for if(y == 0) ..., then the code fails to meet the constant time requirement.

I've looked at a number of other implementations, from the Linux kernel to other cryptographic libraries. They all contain the same undefined behavior, so it appears to be a dead end.

How can I perform the rotate in nearly constant time with a minimum number of instructions?

EDIT: by near constant time, I mean avoid the branch so the same instructions are always executed. I'm not worried about CPU microcode timings. While branch prediction may be great on x86/x64, it may not perform as well on other platforms, like embedded.

None of these tricks would be required if GCC or Clang provided an intrinsic to perform the rotate in near constant time. I'd even settle for "perform the rotate" since they don't even have that.

Check your processor's assembly language. Most processors have rotate and shift operations that perform in constant time. Worst case, use these in inline assembly. — Thomas Matthews, Jul 13 '15 at 15:50
@Thomas - that's exactly where I have arrived. The one pain point there is, GCC does not have a pseudo instruction that says, *"Give me a register of your \[GCC\] choosing"* so I minimally disrupt the optimizer. And I also like to think folks smarter than me have a solution. — jww, Jul 13 '15 at 15:59
Check the assembly language listing generated by the compiler. If the compiler is smart enough to recognize the pattern, it will use the assembly language rotate instruction. If you really need the speed, use assembly language. — Thomas Matthews, Jul 13 '15 at 16:01
Mightn't hurt to define "constant time rotate". You have substantially failed to "put yourself in our shoes" while writing this question. — Lightness Races in Orbit, Jul 13 '15 at 16:04
What exactly do you mean by constant time? Constant time complexity, constant real time, constant number and type of executed instructions...? Can you afford slower execution speed to achieve constant execution time? — MikeMB, Jul 13 '15 at 16:18
Lightness Races in Orbit and Mike - I added a rough definition of *near constant time*. — jww, Jul 13 '15 at 16:24
For the record, GCC does have a way of specifying "give me a register of your choosing" by using the `r` constraint for the value and then referring to the value by `%0` or similar. — Matti Virkkunen, Jul 13 '15 at 16:46
For real portability `8` should be `CHAR_BIT` (defined in ``) — Jonathan Wakely, Jul 13 '15 at 16:56
Imho, on embedded systems you usually don't need sophisticated branch prediction, because you either don't have speculaive execution at all, or the penalty of a missprediction is at least very small, due to the short pipeline. Actually x86 /x64 has probably the biggest problems with branches. — MikeMB, Jul 13 '15 at 16:56
@Matti - thanks, I was not aware. You should consider providing an answer at [How do I ask the assembler to “give me a full size register”?](http://stackoverflow.com/q/27891936). — jww, Jul 13 '15 at 16:57
@jww: That post is already doing exactly what I meant. What comes to the register size problem... anything with `asm` is platform dependent anyways so you will have to use some GCC magic if GCC doesn't give you the correct register size. You might have to specialize the template separately for different types too - for instance if you want a 64-bit rotate to work on a 32-bit platform you need different assembly. — Matti Virkkunen, Jul 13 '15 at 17:05
@jww: For what I mean by GCC magic, check out for instance the "x86 Operand Modifiers" section at https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html — Matti Virkkunen, Jul 13 '15 at 17:05
I just posted an answer that seems to fully solve the problem of getting the compiler to emit optimal code for a rotate. Code from http://blog.regehr.org/archives/1063. — Peter Cordes, Jul 18 '15 at 05:40

score 12 · Accepted Answer · edited May 23 '17 at 12:33

12

I've linked to this answer for the full details from several other "rotate" questions, including this community wiki question, which should be kept up to date with best-practices.

I found a blog post about this issue, and it looks like it's finally a solved problem (with new-enough compiler versions).

John Regehr at the University of Utah recommends version "c" of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.

typedef uint32_t rotwidth_t;  // parameterize for comparing compiler output with various sizes

rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);  // e.g. 31

  assert ( (n<=mask)  &&"rotate by type width or more");
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<<n) | (x>>( (-n)&mask ));
}

rotwidth_t rot_const(rotwidth_t x)
{
  return rotl(x, 7);
}

This could be templated on x's type, but it probably makes more sense for real use, to have the width in the function name (like rotl32). Usually when you're rotating, you know what width you want, and that matters more than what size variable you're currently storing the value in.

Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It's technically implementation-dependent behaviour, but everything uses 2's complement now.)

Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.

I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:

gcc: optimal code with gcc >= 4.9.0, non-branching neg+shifts+or with earlier.
(compile-time const count: gcc 4.4.7 is fine)
clang: optimal code with clang >= 3.5.0, non-branching neg+shifts+or with earlier.
(compile-time const rotate count: clang 3.0 is fine)
icc 13: optimal code.
(compile-time const count with -march=native: generates slower shld $7, %edi, %edi. Fine without -march=native)

Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr's version has the advantage of avoiding undefined behaviour when the rotate count is 0.

There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n is uint32_t. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.

"optimal" means as efficient as:

# gcc 4.9.2 rotl(unsigned int, unsigned int):
    movl    %edi, %eax
    movl    %esi, %ecx
    roll    %cl, %eax
    ret
# rot_const(unsigned int):
    movl    %edi, %eax
    roll    $7, %eax
    ret

When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.

With older compilers, you'll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.

BTW, I modified John Regehr's original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t as well. However, I did notice that changing x to uint64_t while the function return type is still uint32_t makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don't, for

// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );

If anyone can test this with MSVC, it would be useful to know what happens there.

edited May 23 '17 at 12:33

Community

1
1

answered Jul 18 '15 at 05:38

Peter Cordes

245,674
35
423
606

Thank you very much. I know John from his (and Peng Li's) Integer Overflow Checker (long before Clang had sanitizers). I used to use it extensively when self test would fail after compiling with Intel's ICC. ICC was ruthless about removing this sort of undefined behavior. – jww Jul 18 '15 at 23:15
Although `x86` presently defines their shift operators as masking the size, that was a behavioral change in the 80386 [IMHO not a good one]. On earlier processors, shifting was done mod 256 (since the shift amount was specified using a byte register); on processors like the ARM where shifts of 32-255 bits are equivalent to (but faster than) a series of one-bit shifts, such semantics can be useful in graphics programming (e.g. row1 |= dat >> y; row2 |= dat >> (y-32). In any case, on a vintage 8086 or 80286, the behavior of shift instructions uses the whole value of CL. – supercat Jul 20 '15 at 15:49
So you'd get zero from a shift count higher than register width? Intel's vector shifts do work this way. Masking (rather than saturating) the count makes sense for a rotate, but it does seem silly for shifts. – Peter Cordes Jul 20 '15 at 16:05
@PeterCordes: On the 8086, the shift count is specified in an 8-bit register. Consequently, even if Intel had included the hardware to saturate shifts of length (regsize) up to 255, shifts that are slightly larger than multiples of 256 would have their lengths masked. That having been said, there would have been considerable value in having shifts with lengths up to 255 behave in saturating fashion; the value of extending that to longer shifts would be much less. BTW, the rotate-through-carry instructions were much slower on the 80836 than the other shifts (extra cycles proportional to N)... – supercat Jul 23 '15 at 16:20
...and I suspect Intel wanted to avoid having an instruction take too long if N was 255. On the other hand, the behavior of shifting a 32-bit register right through carry by 32 or 33 bits would be different from shifting right by any smaller amount (including zero or one), and so masking yields totally wrong semantics. Interestingly, the ARM allows shift-by-N and 32-bit rotate-by-N, but only allows 32+1-bit rotate through carry to be done one bit at a time. – supercat Jul 23 '15 at 16:28
Heh, it looks like they had to maintain backwards compat with 286 for `rcl/rcr`. If operand size is 8 or 16, the count is masked and then mod 9 or mod 17. Otherwise (32 or 64bit) it's just masked. I didn't realize there were so many weird semantics. `RCR r,cl` is 3x slower than `ROR r,cl` on Haswell, 17 times on Steamroller. (Partly because `ROR r,cl` is 1 cycle latency/ 2/cycle throughput on AMD, but `RCL` is 17 macro-ops, 7 cycle latency.) – Peter Cordes Jul 23 '15 at 17:11
@PeterCordes: In the 80386, rcl/rcr were performed iteratively, so rotating an 8-bit register by 31 would take 30 more cycles than rotating by 1. Did you time shifts of different variable amounts? Otherwise, can you see any reason the C standard shouldn't say that a shift precisely equal to the bit size of the operand must yield either the original value or the result of shifting by size-1 and then shifting one more, with the choice made in arbitrary fashion? There are few platforms where such a rule would impose *any* cost, and I can think of none where the cost would be meaningful. – supercat Jul 24 '15 at 19:37
I'm going by Agner Fog's table of instruction timings. (http://agner.org/optimize/) Modern CPUs have a big enough transistor budget that they can afford constant-time shifters. (e.g. https://en.wikipedia.org/wiki/Barrel_shifter). With pipelined execution units, it's actually a big problem to not know when the results will be ready. Handling cases where multiple results are ready in the same cycle from the same ALU takes extra transistors, and I think is usually done by buffering the results until the next cycle. (else you'd need 2x the register write ports). – Peter Cordes Jul 24 '15 at 21:36
Having the standard say that sounds like a sensible way to get fairly nice behaviour, but without requiring anything that would take extra instructions. It's annoying how much stuff the C standard leaves undefined. Some of it is to avoid assuming two's complement, unless that's finally changed. (I think I read that adding atomics to the standards finally resulted in standardizing some two's complement behaviour, but I forget the details.) – Peter Cordes Jul 24 '15 at 21:48
@PeterCordes: It's sufficiently rare for code to use RCR/RCL with any shift amount other than 1 that I don't think there's much need for such instructions to run in constant time for other shift values. Hardware to compute RCR with an arbitrary 8-bit shift count in constant time would be significantly more complicated than hardware which could support *all five* of the operations whose result is unaffected by carry-in, and I don't think it would be nearly useful enough to justify such cost. – supercat Aug 17 '15 at 20:05
@supercat: Intel and AMD special-case `RCR/L r, 1`. SnB: 2 cycles, with any other count running in 8 cycles. If variable-latency operations weren't such a problem for the out-of-order machinery, `RCL r, 2` could probably run in only one more cycle without more execution-unit complexity. This isn't the only case where Sandybridge has a higher-than-probably-needed latency for something, just to standardize possible latencies. As you say, special-casing other counts, or count-from-a-register, isn't going to be worth it. – Peter Cordes Aug 17 '15 at 20:13
@jww: I think this is a better answer than the one you accepted. It compiles to the same ideal code with a wider range of compilers. (i.e. gcc, not just clang.) I haven't tested MSVC with either. I've pointed a lot of other rotate questions at this answer. http://meta.stackoverflow.com/questions/302695/adding-links-to-a-best-practices-qa-to-all-questions-on-a-topic-c-c-bitwise, so it'd be nice if this was marked accepted. (unless it's terrible on MSVC or something, in which case it shouldn't be until we have a workaround.) – Peter Cordes Aug 17 '15 at 20:49
Peter - I agree; I think you are right. I've had more time to research it, and this is the ["portable rotate pattern"](http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157). Intel has not sounded in yet (see [Circular rotate that does not violate C/C++ standard?](https://software.intel.com/en-us/forums/topic/580884) on the Intel forums). The problem I have is the accepted answer does answers the question I asked. Let me think about what to do.... – jww Aug 17 '15 at 22:51
@jww: Yes, the accepted answer is *a* correct answer. But mine is IMO a *better* correct answer. Not all sufficient answers are equal. – Peter Cordes Aug 17 '15 at 23:06

score 6 · Answer 2 · answered Jul 13 '15 at 16:02

6

You can add one additional modulo operation to prevent the shifting by 32 bits, but I'm not convinced this is faster than using an if check in conjunction with branch predictors.

template <class T> inline T rotlMod(T x, unsigned int y)
{
    y %= sizeof(T)*8;
    return T((x<<y) | (x>>((sizeof(T)*8-y) % (sizeof(T)*8))));
}

answered Jul 13 '15 at 16:02

Mark B

91,641
10
102
179

4

This is where I believe a simple inline assembly instruction would be easier to read. :-) – Thomas Matthews Jul 13 '15 at 16:03
2

Yeah but portability! – Lightness Races in Orbit Jul 13 '15 at 16:05
I'm not 100% sure if it's well-defined, but I think you can simplify that to `-y % (sizeof(T)*8)`. It seems to be better optimized by clang++ and g++ – dyp Jul 13 '15 at 16:18
*"... in conjunction with branch predictors..."* - I think that assumes the branch predictors work as {expected|desired}. That may not be the case on low end processors and boards. Plus, the guys advancing these attacks are very clever, so I want to avoid giving them a toehold. – jww Jul 13 '15 at 16:25
2

Just for readability I'd store `sizeof(T)*8` in a local variable. – MikeMB Jul 13 '15 at 16:25
@MikeMB - I've got other code that stashes it away in a `static const`. – jww Jul 13 '15 at 16:27
@ThomasMatthews: What would make things easier to read would be to amend the C standard to specify that `x>>y`, when `y` is precisely equal to the bit size of `x`, must yield either `x>>(y-1)>>1` or `x`, chosen in arbitrary fashion. The vast majority of non-hyper-modern C platforms would naturally comply with such a standard; the only ones I know of where compliance would add cost are either those which use a computed jump into a routine that does repeated one-byte-shifts (for platforms without an n-bit shift instruction); the additional cost there would be slight compared with even the... – supercat Jul 20 '15 at 15:38
...*best-case* time of the shift routine. It's possible the multi-word (e.g. `long long`) shift routines used by some compilers would need tweaking, but an implementation of `x>>y` compliant with the above requirement would be cheaper than an implementation of `x>>(y-1)>>1` which had to comply with C standard behavior when `y` was precisely equal to the bit size of `x`. – supercat Jul 20 '15 at 15:42
This generates nice asm on clang 3.5 and later, but not on gcc (even 5.2) See my answer on this question, or a shorter version on http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c. It's always branchless, but it's a sequence of several instructions. Godbolt: https://goo.gl/wJBthc – Peter Cordes Aug 17 '15 at 20:47

score 3 · Answer 3 · answered Jul 14 '15 at 15:54

Writing the expression as T((x<<y) | ((x>>(sizeof(T)*CHAR_BITS-y-1)>>1)) should yield defined behavior for all values of y below the bit size, assuming that T is an unsigned type with no padding. Unless the compiler has a good optimizer, the resulting code may not be as good as what would have been produced by your original expression. Having to put up with clunky hard to read code which will yield slower execution on many compilers is part of the price of progress, however, since a hyper-modern compiler which is given

if (y) do_something();
return T((x<<y) | (x>>(sizeof(T)*8-y)));

might improve the "efficiency" of the code by making the call to do_something unconditional.

PS: I wonder if there are any real-world platforms where changing the definition of shift-right so that x >> y when y is precisely equal to the bit size of x, would be required to yield either 0 or x, but could make the choice in an arbitrary (unspecified) fashion, would require the platform to generate extra code or would preclude genuinely useful optimizations in non-contrived scenarios?

score 2 · Answer 4 · answered Jul 13 '15 at 19:16

2

An alternative to the extra modulo is to multiply by 0 or 1 (thanks to !!):

template <class T> T rotlMod(T x, unsigned int y)
{
    y %= sizeof(T) * 8;
    return T((x << y) | (x >> ((!!y) * (sizeof(T) * 8 - y)));
}

answered Jul 13 '15 at 19:16

Jarod42

173,454
13
146
250

1

Are you sure that the `!!` doesn't force the compiler to generate a branch? – Mark B Jul 20 '15 at 15:55
@MarkB: You're right, this unfortunately generates terrible asm even on gcc 5.2 and clang 3.7. https://goo.gl/wJBthc (godbolt). Also, templating a rotate function seems like a terrible idea unless you just use it as a building block for named wrappers, like `rotl32`. If you want 32bit rotate, you don't want to accidentally do a 64bit rotate because of the size of your temporary. – Peter Cordes Aug 17 '15 at 20:40

Near constant time rotate that does not violate the standards

4 Answers4

Linked