10

I have googled, wikied and read the RFC of ZIP, but can't find any info about the exact algorithm which is used in ZIP.

I have found info about ZIP == TAR + GZIP

But, I'm confused by this info.

Since GZIP uses LZW algorithm as I remember, and TAR uses LZMA, I can't imagine how it could be that ZIP == TAR + GZIP (LZMA + LZW - ???)

Could you help me with finding the algorithm of ZIP? I want to implement it.

Oreo
  • 438
  • 3
  • 13
  • 1
    ZIP can use any of several algorithms. There's a spec laying around on the web somewhere... – Hot Licks Apr 18 '12 at 17:22
  • 2
    Ah, [here it is](http://www.pkware.com/documents/casestudies/APPNOTE.TXT): Deflate, Deflate64, Implode, BZIP2, LZMA, or PPMd+. – Hot Licks Apr 18 '12 at 17:24

2 Answers2

14

Zip provides capabilities roughly equivalent to the combination of tar with gzip.

tar just collects a number of files together into a single file, preserving information about the original files (e.g., paths, dates). Contrary to the statement in the question, it does no compression by itself.

gzip just takes a single file and compresses it.

Zip does both of those -- i.e., it stores a number of constituent files into an archive (again, preserving things like paths, dates, etc.), and compresses them. Unlike tar + gzip, it compresses each file individually, and leaves the "directory" information about the constituent files un-compressed. This makes it easy to work with individual files in the archive (insert, delete, decompress, etc.) but also means that it usually won't get as good of compression overall.

Rather than re-implementing zip's compression algorithm, you're almost certainly better off downloading the code (extremely portable, very liberal license) from the zlib web site. The zlib web site does have a fairly reasonable explanation of the algorithms. If you really insist on doing this yourself, you probably also want to look at RFC 1950, 1951, and 1952.

Oreo
  • 438
  • 3
  • 13
Jerry Coffin
  • 437,173
  • 71
  • 570
  • 1,035
  • 1
    That's also what [Wikipedia](http://en.wikipedia.org/wiki/Tar_(file_format)#Naming_of_compressed_tar_files) says. – fb55 Apr 18 '12 at 17:24
  • 1
    Note that zlib only implements the compression/decompression, not the archiving mechanism. – Hot Licks Apr 18 '12 at 17:29
  • 1
    @HotLicks: Right -- if you want code for the archiving part, that's at the [Info-zip web site](http://www.info-zip.org). – Jerry Coffin Apr 18 '12 at 17:30
5

"zip" in this context is a file format that permits several different compression methods. They include deflate, deflate64, bzip2, lzma, wavpack, and ppmd. In practice however, you will almost always see deflate used exclusively in zip files, for compatibility.

deflate is also the compression method used in gzip and by zlib, as well as by the png image format.

deflate is an LZ77 compressor, not LZ78.

tar is an archiver, not a compressor. It produces the .tar file format. The .tar file is usually compressed (conveniently by the tar program itself calling external programs) which adds a suffix, e.g. .tar.gz for gzip compression. tar options include -z for gzip, -j for bzip2 (.bz2), and -J for lzma (.xz).

You do not need to implement the algorithm for deflate. It has been done for you. You can use zlib in your code, which has a very liberal license.

Mark Adler
  • 79,438
  • 12
  • 96
  • 137