2

I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ? actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance

Bahram
  • 1,152
  • 2
  • 16
  • 34

3 Answers3

5

If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding

Huffman Coding

Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.

LZW compression

I think above two are sufficient for encoding text data efficiently and are easy to implement.

Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.

Vikram Bhat
  • 5,686
  • 3
  • 16
  • 18
  • May I know how much percentage of original text file size could be reduced by using these compression on average? – FranXho Mar 09 '18 at 08:06
  • huffman coding it is reduced to 0.5 on average while LZW can give 1/5 on average , if used on written language text. – Vikram Bhat Mar 11 '18 at 14:20
2

LZ-like coders are not any good for text compression. The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)

But for best compression, I'd suggest to convert unicode texts to a bytewise format, eg. utf8, and then use an algorithm with known good results on texts, eg. BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).

Also some preprocessing can be applied to improve results of text compression (see http://xwrt.sourceforge.net/) And there're some compressors with even better ratio than suggested ones (mostly paq derivatives), but they're also much slower.

Here I tested various representations of russian translation of Witten's "Modeling for text compression":

                             7z    rar4  paq8px69
modeling_win1251.txt  156091 50227 42906 36254
modeling_utf16.txt    312184 52523 50311 38497
modeling_utf8.txt     238883 53793 44231 37681
modeling_bocu.txt     165313 53073 44624 38768
modeling_scsu.txt     156261 50499 42984 36485

It shows that longer input doesn't necessarily mean better overall compression, and that SCSU, although useful, isn't really the best representation of unicode text (win1251 codepage is one, too).

Shelwien
  • 2,120
  • 15
  • 17
  • Actually, convert to the bytewise representation described by http://www.unicode.org/faq/compression.html, rather than UTF-8. That document also recommends Burrows-Wheeler Compression for larger Unicode texts. – Jim Mischel Nov 19 '13 at 14:54
1

PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.

There are three flavors that I recommend:

  • ZPAQ - Future facing container for PAQ algorithims (created to make the future of PAQ easier)
  • PAQ8PX/PAQ8KX - The most powerful, works with EXE and WAV files as well.
  • PAQ8PF - Faster (both compression and decompression) and mostly intended for TXT files

You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.

Once you have a functional binary its simple to use, the documentation can be found here.

Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.

ZaxLofful
  • 799
  • 6
  • 13