2

is there an alternative for rotl32 in C language?

i found this: Near constant time rotate that does not violate the standards but still trying to get an optimized one

my code:

k0 = rotl32 ((k3 ^ k2 ^ k ^ k0), 1u)
Community
  • 1
  • 1

3 Answers3

1

I think this is the best portable option:

uint32_t rotl32(uint32_t var, uint32_t hops)
{
    return (var << hops) | (var >> (32 - hops));
}
chqrlie
  • 98,886
  • 10
  • 89
  • 149
dromtrund
  • 235
  • 2
  • 7
  • This does not work for `hops == 0`. Fix it without a test: `return (var << hops) | (var >> ((32 - hops) & 31));` – chqrlie Feb 11 '17 at 16:39
  • What would make this fail? We're shifting right out of bounds, and the shifting feeds 0's in from the left. Works fine in gcc 4.8.5? – dromtrund Feb 11 '17 at 17:05
  • 1
    If `hops == 0`, `var >> (32 - hops)` has undefined behavior. On x86 processors, it would compute `var >> 0` but on others it might compute something else, such as `var >> 32` which could evaluate to 0, which is OK too, but the C Standard does not guarantee that these are the only possible behaviors (unfortunately). In fact anything is possible, so this corner case must be avoided. – chqrlie Feb 11 '17 at 17:08
1

You have opencl tag in your question, so with a kernel

__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
  int idx = get_global_id(0);
  unsigned int a0=a[idx];
    for(int i=0;i<100;i++)
        a0=rotate(a0,1280u);
  b[idx] = rotate(a0,1280u);

}   

rotate performance on R7-240 GPU according to a benchmark:

32 million element-array of 32b unsigned integers such as a0, kernel execution takes 16ms where each thread does 100 times(10 ms for 1 times) rotation of 1280u step length(so latency is independent of step length)) . Its more than 200 Gflops(but on integers) reaching %40 theoretical maximum of gpu . Maybe its even faster for integers than floats(they would need normalization after shift I suppose).

Example:

__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
  int idx = get_global_id(0);
  unsigned int a0=a[idx];

  b[idx] = rotate(a0,2u);

}   

input:

        buf[0] = 80;
        buf[1] = 12;
        buf[2] = 14;
        buf[3] = 5 ;
        buf[4] = 70;

output:

320 
48 
56 
20 
280
huseyin tugrul buyukisik
  • 9,464
  • 3
  • 39
  • 81
0

dromtrund posted a good portable solution:

uint32_t rotl32(uint32_t var, uint32_t hops) {
    return (var << hops) | (var >> (32 - hops));
}

Unfortunately, this function has undefined behavior for hops == 0. On the x86 processors, only the low order bits of hops are significant. This behavior can be forced this way:

uint32_t rotl32(uint32_t var, uint32_t hops) {
    return (var << hops) | (var >> ((32 - hops) & 31));
}

Both functions compile to optimal code with gcc 4.9 and up, clang 3.5 and up and icc 17, as can be verified with Godbolt's Compiler Explorer.

John Regehr has an interesting blog article on this very subject.

chqrlie
  • 98,886
  • 10
  • 89
  • 149