39

I was wondering if anyone has a list of data compression algorithms. I know basically nothing about data compression and I was hoping to learn more about different algorithms and see which ones are the newest and have yet to be developed on a lot of ASICs.

I'm hoping to implement a data compression ASIC which is independent of the type of data coming in (audio,video,images,etc.)

If my question is too open ended, please let me know and I'll revise. Thank you

Veridian
  • 3,278
  • 10
  • 38
  • 73
  • 1
    Hmmmm there are a lot of compression algorithms what are you looking for in terms of the "best". Such as speed, or entirely loss-less, or highest compression ratio? In terms of which have ASIC's designed for them that is more of a research question. I am sure most if not all of the mainstream compression algorithms have some sort of ASIC implemenation. – Nomad101 May 09 '13 at 19:18
  • 1
    http://www.ccs.neu.edu/home/jnl22/oldsite/cshonor/jeff.html – taocp May 09 '13 at 19:19
  • @taocp broken link – nz_21 May 17 '20 at 22:30

5 Answers5

43

There are a ton of compression algorithms out there. What you need here is a lossless compression algorithm. A lossless compression algorithm compresses data such that it can be decompressed to achieve exactly what was given before compression. The opposite would be a lossy compression algorithm. Lossy compression can remove data from a file. PNG images use lossless compression while JPEG images can and often do use lossy compression.

Some of the most widely known compression algorithms include:

ZIP archives use a combination of Huffman coding and LZ77 to give fast compression and decompression times and reasonably good compression ratios.

LZ77 is pretty much a generalized form of RLE and it will often yield much better results.

Huffman allows the most repeating bytes to represent the least number of bits. Imagine a text file that looked like this:

aaaaaaaabbbbbcccdd

A typical implementation of Huffman would result in the following map:

Bits Character
   0         a
  10         b
 110         c
1110         d

So the file would be compressed to this:

00000000 10101010 10110110 11011101 11000000
                                       ^^^^^
                              Padding bits required

18 bytes go down to 5. Of course, the table must be included in the file. This algorithm works better with more data :P

Alex Allain has a nice article on the Huffman Compression Algorithm in case the Wiki doesn't suffice.

Feel free to ask for more information. This topic is pretty darn wide.

user123
  • 8,613
  • 2
  • 25
  • 50
  • 1
    I'm just asking out of curiosity- are there any compression algorithms that can recognize patterns in the data? For example: `ababab`. – Novak May 09 '13 at 20:04
  • That's a slightly more complex version of RLE, or to be more precise, LZ77 :P (By that, I mean that LZ77 handles that, but it usually won't do anything unless the manipulation of a piece of data will shrink the file) – user123 May 09 '13 at 20:05
  • @Magtheridon96, wow, thank you very much. Do you know of any resources showing performance marks of these algorithms on different platforms? For instance, how fast someone could get Huffman running and if it was a software or hardware implementation? I'm looking to implement a hardware data compression unit (if I see it makes sense), that would provide considerable improvement over a software implementation. – Veridian May 13 '13 at 15:46
  • @Magtheridon96, Do I need to know statistical information about the data coming in ahead of time? I'm planning on just dealing with binary data. – Veridian May 13 '13 at 15:54
  • I don't know of any sources that will show you how well these work in terms of performance, but I do know that they may not be effective every single time unless you have several kilobytes of data. Huffman will perform wonderfully if you have a ton of aligned bytes of the same value. LZ77 will work well if you have lots of byte sequences that are identical clustered within ~32KB. So basically, when you have repeating data, this DEFLATE combination will serve you quite well :) – user123 May 13 '13 at 18:23
  • Thank you. What if instead of aaaaaaaabbbbbcccdd, you had the characters out of order, like this: aaaaddddaaaabbbbbaaaaccccbbbbcccdd? – Veridian Jul 17 '14 at 16:05
  • @starbox Then it would be `00001110 11101110 11100000 10101010 10000011 01101101 10101010 10110110 11011101 11000000` with the last 5 `0`'s being padding bits. The order doesn't matter at all. – user123 Jul 18 '14 at 14:15
  • What is your opinion on gzip? – Veridian Oct 16 '14 at 19:04
  • @starbox gzip and zip use the same algorithm (DEFLATE, a combination of LZ77 and Huffman), but gzip can look for redundancy across multiple files. ZIP only does compression on individual files before adding them to an archive. In general, gzip has the potential to give you better compression ratios. – user123 Oct 18 '14 at 08:20
  • I know I'm digging an old topic, but I'm puzzled about something and I haven't seen anyone else questioning this. I understand the need to add extra padding bits, but when you are "decompressing", how do you know they are just padding bits and not 5 "a"s? We could end up with `aaaaaaaabbbbbcccddaaaaa`, right? Am I missing something here? – Fábio Duque Silva Mar 02 '20 at 01:13
  • 1
    @FábioDuqueSilva Well, I don't know how implementations do it, but I know that it's possible with O(1) space because you can add an extra integer to keep track of the padding bits at the very end (i.e., along with the table, you store an extra integer: 5) – user123 Mar 04 '20 at 12:27
5

My paper A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems (permalink here) reviews many compression algorithms and also techniques for using them in modern processors. It reviews both research-grade and commercial-grade compression algorithms/techniques, so you may find one which has not yet been implemented in ASIC.

user984260
  • 3,031
  • 5
  • 20
  • 36
5

Here are some lossless algorithms (can perfectly recover the original data using these):

  • Huffman code
  • LZ78 (and LZW variation)
  • LZ77
  • Arithmetic coding
  • Sequitur
  • prediction with partial match (ppm)

Many of the well known formats like png or gif use variants or combinations of these.

On the other hand there are lossy algorithms too (compromise accuracy to compress your data, but often works pretty well). State of the art lossy techniques combine ideas from differential coding, quantization, and DCT, among others.

To learn more about data compression, I recommend https://www.elsevier.com/books/introduction-to-data-compression/sayood/978-0-12-809474-7. It is a very accessible introduction text. The 3rd edition out there in pdf online.

sma
  • 295
  • 3
  • 5
4

There are an awful lot of data compression algorithms around. If you're looking for something encyclopedic, I recommend the Handbook of Data Compression by Salomon et al, which is about as comprehensive as you're likely to get (and has good sections on the principles and practice of data compression, as well).

My best guess is that ASIC-based compression is usually implemented for a particular application, or as a specialized element of a SoC, rather than as a stand-alone compression chip. I also doubt that looking for a "latest and greatest" compression format is the way to go here -- I would expect standardization, maturity, and fitness for a particular purpose to be more important.

comingstorm
  • 23,012
  • 2
  • 38
  • 64
1

LZW or Lempel Ziv algorithm is a great lossless one. Pseudocode here: http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html

schilippe
  • 81
  • 3