5

A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.

It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.

Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.

I recall that it was invented at Google, but it was not Snappy.

Potatoswatter
  • 126,977
  • 21
  • 238
  • 404
  • Maybe [this one](http://www.unicode.org/notes/tn31/)? – Evgeny Kluev Mar 14 '14 at 17:48
  • @EvgenyKluev Nope, that's canonical Lempel-Ziv with a dictionary. Taking the difference of letters as a first step pretty much precludes dictionaries because it obliterates the symbols. – Potatoswatter Mar 14 '14 at 17:51
  • @Potatoswatter: The Unicode TN does use L-Z, but it *also* difference-encodes symbols ("That's why it's possible to store differences between two 2 byte characters as a single byte signed value of a range of [-64, 63]"); the two techniques are completely compatible (the dictionary contains a sequence of offsets, but that's just fine.) – rici Mar 14 '14 at 19:29
  • DPCM with a quantizer of one. – user515430 Mar 15 '14 at 01:58
  • @rici Oh, I see now, but that's sort of an afterthought in that algo. It does not achieve any compression of ASCII. To be fair, I said "as a first step", and that's a final step. :) – Potatoswatter Mar 15 '14 at 05:16
  • @Potatoswatter: It is the first step, but it really doesn't matter :) The algo in the UTN does not compress ASCII, but the difference step is the same idea and the same rationale (scripts are in compact ranges). You could just use Huffman encoding, even with a fixed codebook, for a lightweight compression algo. – rici Mar 15 '14 at 05:33
  • @rici Fixed-code Huffman is what I'd like to improve on. It's what's proposed for [HTTP 2 header compression](http://http2.github.io/http2-spec/compression.html#huffman.codes), but as you can see the best case isn't very good. – Potatoswatter Mar 15 '14 at 08:33
  • @Potatoswatter: From what I can see in that document, the Huffman coding is generally reducing byte-length by around 25%, although it doesn't do so well with the upper-case cookie text. It's possible that delta-encoding and then Huffman encoding the deltas (presumably again using a fixed table rather than the Huffman algorithm) would do a little better; it would depend a lot on the nature of the text. Remember that of all the non-contextual variable-length encodings of a given input stream with a fixed alphabet with known probabilities, the Huffman coding will be the shortest. – rici Mar 16 '14 at 04:50
  • @rici Yes, but a *little* bit of context goes a long way, which is what this question is about. – Potatoswatter Apr 18 '14 at 14:36

1 Answers1

2

I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,

The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

Andy Jones
  • 3,998
  • 2
  • 15
  • 18