Algorithm for bit expansion/duplication?

Question

Is there an efficient (fast) algorithm that will perform bit expansion/duplication?

For example, expand each bit in an 8bit value by 3 (creating a 24bit value):

1101 0101 => 11111100 01110001 11000111

The brute force method that has been proposed is to create a lookup table. In the future, the expansion value may need to be variable. That is, in the above example we are expanding by 3 but may need to expand by some other value(s). This would require multiple lookup tables that I'd like to avoid if possible.

If you're only dealing with 8-bit values, the lookup table is almost certainly going to be the best option. It uses very little space. Can you give more details on your use case and what operations you expect to be common? — templatetypedef, Jan 26 '12 at 18:27
Input is a constant serial bit stream. In the current requirement, each chunk of data arrives 8 bytes at a time, which then needs each bit expanded by 3 to be sent out as another bit stream. 64bits in 192bits out. A future requirement may involve adding "header" bits before each expanded 8 bit value and of course padding to a byte boundary. LUTs are quick but given how often this needs to run, any potential performance improvement would be appreciated. — jivany, Jan 26 '12 at 18:49
Many architectures have instructions that can greatly speed up this sort of computation. If you're not afraid of breaking cross-platform compatibility leveraging these instructions is almost certainly a win -- and if you're optimizing something this algorithmically "trivial" then turning to low-level optimization is key. — Kaganar, Jan 27 '12 at 15:27
@Kaganar Agreed. This is for a PPC embedded system and I've seen optimizations for other bit operations but this bit expansion problem doesn't seem to be common. I know someone smarter than me has figured this out. ;) — jivany, Jan 27 '12 at 18:42
What's the exact architecture? (And aye, embedded applications explains why you're being rabid about speed -- fixed budget on fixed hardware.) — Kaganar, Jan 27 '12 at 20:14
PPC440. Performance isn't an issue, yet... A LUT approach is going to be used for the first implementation and multiple LUTs for future changes when we need to expand to other number of bits. Now this is becoming an exercise to see if there is an algorithmic approach that can be used. — jivany, Jan 27 '12 at 20:36

score 8 · Accepted Answer · answered Jan 28 '12 at 08:54

There is a chance to make it quicker than lookup table if arithmetic calculations are for some reason faster than memory access. This may be possible if calculations are vectorized (PPC AltiVec or Intel SSE) and/or if other parts of the program need to use every bit of cache memory.

If expansion factor = 3, only 7 instructions are needed:

out = (((in * 0x101 & 0x0F00F) * 0x11 & 0x0C30C3) * 5 & 0x249249) * 7;

Or other alternative, with 10 instructions:

out = (in | in << 8) & 0x0F00F;
out = (out | out << 4) & 0x0C30C3;
out = (out | out << 2) & 0x249249;
out *= 7;

For other expansion factors >= 3:

unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
  shift = scale * (N - 1);
  mask &= ~(mask << scale);
  mask |= mask << (scale * N);
  out = out * ((1 << shift) + 1) & mask;
}
out *= (1 << N) - 1;

Or other alternative, for expansion factors >= 2:

unsigned mask = 0x0FF;
unsigned out = in;
for (scale = 4; scale != 0; scale /= 2)
{
  shift = scale * (N - 1);
  mask &= ~(mask << scale);
  mask |= mask << (scale * N);
  out = (out | out << shift) & mask;
}
out *= (1 << N) - 1;

shift and mask values are better to be calculated prior to bit stream processing.

Fantastic response. My colleague and I came close to this whilst doing some handwaving and whiteboard brainstorming but this is much more efficient than our approach. I'll have to run some tests once we have the rest of the code implemented and see how it fares. — jivany, Jan 28 '12 at 14:23
Does anyone have a link to the math behind this? I've been searching around but have only managed to find magic without an explanation as to how this works. I see there is some pattern to the magic numbers but everything else is escaping me. — cory.todd, Apr 01 '16 at 01:53
nvm, I figured it out. Helps to write out the binary and then find the pattern. Still though, any links on the topic would be greatly appreciated. https://gist.github.com/corytodd/056ed01228f59fee9a13d00fc25b9a62 — cory.todd, Apr 01 '16 at 02:22
@cory.todd: inspired by two code snippets from ["Bit Twiddling Hacks"](http://graphics.stanford.edu/~seander/bithacks.html#Interleave64bitOps). — Evgeny Kluev, Apr 01 '16 at 03:44

comingstorm · Answer 2 · 2012-01-27T02:04:40.970

1

You can do it one input bit at at time. Of course, it will be slower than a lookup table, but if you're doing something like writing for a tiny, 8-bit microcontroller without enough room for a table, it should have the smallest possible ROM footprint.

edited Jan 27 '12 at 02:04

answered Jan 26 '12 at 21:37

comingstorm

23,012
2
38
64

Algorithm for bit expansion/duplication?

2 Answers2

Linked