is there an alternative for rotl32
in C language?
i found this: Near constant time rotate that does not violate the standards but still trying to get an optimized one
my code:
k0 = rotl32 ((k3 ^ k2 ^ k ^ k0), 1u)
is there an alternative for rotl32
in C language?
i found this: Near constant time rotate that does not violate the standards but still trying to get an optimized one
my code:
k0 = rotl32 ((k3 ^ k2 ^ k ^ k0), 1u)
I think this is the best portable option:
uint32_t rotl32(uint32_t var, uint32_t hops)
{
return (var << hops) | (var >> (32 - hops));
}
You have opencl tag in your question, so with a kernel
__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
int idx = get_global_id(0);
unsigned int a0=a[idx];
for(int i=0;i<100;i++)
a0=rotate(a0,1280u);
b[idx] = rotate(a0,1280u);
}
rotate
performance on R7-240 GPU according to a benchmark:
32 million element-array of 32b unsigned integers such as a0
, kernel execution takes 16ms where each thread does 100 times(10 ms for 1 times) rotation of 1280u step length(so latency is independent of step length)) . Its more than 200 Gflops(but on integers) reaching %40 theoretical maximum of gpu . Maybe its even faster for integers than floats(they would need normalization after shift I suppose).
Example:
__kernel void rotateGpu(__global unsigned int * a,__global unsigned int * b)
{
int idx = get_global_id(0);
unsigned int a0=a[idx];
b[idx] = rotate(a0,2u);
}
input:
buf[0] = 80;
buf[1] = 12;
buf[2] = 14;
buf[3] = 5 ;
buf[4] = 70;
output:
320
48
56
20
280
dromtrund posted a good portable solution:
uint32_t rotl32(uint32_t var, uint32_t hops) {
return (var << hops) | (var >> (32 - hops));
}
Unfortunately, this function has undefined behavior for hops == 0
. On the x86 processors, only the low order bits of hops
are significant. This behavior can be forced this way:
uint32_t rotl32(uint32_t var, uint32_t hops) {
return (var << hops) | (var >> ((32 - hops) & 31));
}
Both functions compile to optimal code with gcc 4.9 and up, clang 3.5 and up and icc 17, as can be verified with Godbolt's Compiler Explorer.
John Regehr has an interesting blog article on this very subject.