Repetition-based, pattern-based data compression algorithm

Question

Suppose I have the following string:

ABCADCADCADABC

I want to compress it by finding repeating substrings. What's an algorithm that gives the optimal compression?

In the above example it should return

AB*1 CAD*3 ABC*1

For comparison, a greedy algorithm might return

ABC*1 ADC*2 AD*1 ABC*1

score 3 · Answer 1 · answered Apr 19 '12 at 02:12

3

Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.

answered Apr 19 '12 at 02:12

smocking

3,467
16
22

Chris Cain · Answer 2 · 2012-04-19T16:34:34.303

This sounds like a job for suffix arrays/trees!

http://en.wikipedia.org/wiki/Suffix_array

You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):

ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$

From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.

http://en.wikipedia.org/wiki/Suffix_tree

Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.

Not sure how you'd make a perfect decomposition, based on the information from this array. Say for "banana" that would be: a, ana, anana, banana, na, nana. Note that "ana" is not suitable here because it overlaps with another "ana" so you'd have to exclude overlapped patterns also. Looks like a lot of work. — Joric, Sep 10 '18 at 07:20

score 2 · Answer 3 · answered Apr 19 '12 at 03:57

It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.

So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1

score 1 · Answer 4 · answered Apr 19 '12 at 02:45

1

Better still would be:

ABCAD(6,3)(3,11)

where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.

This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

answered Apr 19 '12 at 02:45

Mark Adler

79,438
12
96
137

Ah, thanks Mark. Great point, though this is actually not applicable to my problem. I can't do backreferences like that. (This isn't really about compression -- I reformulated it that way to make for a cleaner question. Maybe that was a bad idea! It's really about finding an optimal segmenting of repeating strings.) – dreeves Apr 19 '12 at 04:26
I guess I was somehow thrown by "compression algorithm" and "optimal compression" in the question. What is the actual problem that needs this segmenting? It's usually much easier to answer a question if you know what the question is. – Mark Adler Apr 19 '12 at 06:18
Yeah, it's just that the actual problem is hard to explain! Obviously the reformulation is not so easy either. :) But I think it works to just disallow backreferences. So we're finding a segmentation of the string into substrings and then removing substrings that are identical to their predecessors. What segmentation minimizes the resulting string length? – dreeves Apr 19 '12 at 07:35

Repetition-based, pattern-based data compression algorithm

4 Answers4