4

Say, I have a number of strings which are quite similar but no absolutely identical.

They can differ more or less, but similarity can be seen by the naked eye.

All lengths are equal, each is 256 bytes. The total number of strings is less than 2^16.

What would be the best compression method for such case?

UPDATE (data format):

I can't share the data but I can describe it quite close to reality:

Imagine the notation (like LOGO language) which is the sequence of commands for some device for moving and drawing on plane. Such as:

U12 - move up 12 steps
D64 - move down 64 steps
C78 - change drawing color to 78
P1  - pen down (start drawing)

and so on.

The whole vocabulary of this language doesn't exceed the size of English alphabet.

The string then describes a whole picture: "U12C6P1L74D74R74U74P0....".

Imagine now the class of ten thousand children who were told to draw some very specific image with the help of this language: like the flag of their country. We will get 10K of strings which are all different and all alike at the same time.

Our task is compress the whole bunch of strings as good as possible.

My suspicion here is that there is a way to exploit this similarity and common length of the strings, while, Huffman e.g. wont use it explicitly.

lithuak
  • 5,409
  • 8
  • 38
  • 52
  • 1
    I'd probably look first if [Tries/Prefix Trees](http://en.wikipedia.org/wiki/Trie) or something similar can help. Then [Huffman Coding](http://en.wikipedia.org/wiki/Huffman_coding) and [LZ*](http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv). There may be better ways, but we need to know more about the data and how it's used. – Alexey Frunze Mar 11 '12 at 09:48
  • what do you mean by "quite similar"? it's hard to pick the right method without knowing more. – Karoly Horvath Mar 11 '12 at 10:08
  • @KarolyHorvath: I've updated the question with data description. – lithuak Mar 11 '12 at 10:13

3 Answers3

1

Could you tell us what's the data ? Maybe like a DNA sequence ? Like

AGCTGTGCGAGAGAGAGCGGTGGG...

GGCTGTGCGAGCGAGAGCGGTGGG...

CGCTGTGAGAGNGAGAGCGGTGGG...

NGCTGTGCGAGAGAGAGCGGTGGG...

GGCTGTGCGAGTGAGAGCGGTGGG...

... ...

? Maybe or not. Anyway here is two levels or two ways to think:

  1. Huffman coding : ref. wikipedia by yourself

  2. Stringology : ref. http://books.google.com.hk/books/about/Jewels_of_stringology.html?id=9NdohJXtIyYC

I think it's easy to solve your problem but hard to choose the best way. You can design several method to compare by using http://en.wikipedia.org/wiki/Data_compression and more tools .

Community
  • 1
  • 1
Gentle Yang
  • 321
  • 1
  • 10
  • I've updated the question with data description. My concern with Huffman is that it will not exploit the string likeness explicitly. Correct me if I'm wrong. – lithuak Mar 11 '12 at 10:18
  • Yeah,got your update. As @Alex 's say, you can try some Stringology algorithms like prefix tree / suffix tree / tries tree . – Gentle Yang Mar 11 '12 at 13:17
0

Since you have a fix width of 256 bytes and it's a power of 2 I would try a burrow-wheeler transformation or a move-to-front algorithm with that size or maybe the double of that size. Then you can try a huffman code. Maybe you can try a hilbert curve on 256 bytes and then a bwt and mft?

Gigamegs
  • 12,342
  • 7
  • 31
  • 71
0

"The total number of strings is less than 2^16." This is a small, bounded number, which makes your job very easy: Why don't you keep a lookup table (hash table) of all strings previously seen. You can then convert every line of 256 bytes into a two-byte index into this lookup table.

You then have a sequence of 16-bit integers. These integers will contains patterns like "after the pen went down, there is a 90% chance that the next command is to start to draw". If the data contains patterns like this, PPM is your choice. 7-zip has a high-quality PPM-implementation. You can choose it using the GUI or cmd-line.

usr
  • 162,013
  • 33
  • 219
  • 345