69

Whenever I see some source packages or binaries which are compressed with GZip I wonder if there are still reasons to favor gz over xz (excluding time travel to 2000), the savings of the LZMA compression algorithm are substantial and decompressions isn't magnitudes worse than gzip.

soc
  • 27,310
  • 18
  • 99
  • 209
  • 13
    For what it's worth: decompression is **significantly** faster for `tar.gz` vs. `tar.xz`. Decompressing the xz utils themselves takes an ~0.083s for `tar.gz` and ~0.280s for `tar.gz` (pure user time) on my PC. Compression times are also *significantly* worse than gz (and even bzip2!). And with the tendency towards high-bandwidth connections, those tend to raise in priority compared to pure compression ratio. – Joachim Sauer Jun 27 '11 at 13:11
  • 6
    But xz compression ratio is so much nicer. If you want speed though, lzo is the choice. That said, some Linux distros use only xz -2 to compress e.g. RPMs, as they have determined -9 *really* is not worth their time. – jørgensen Dec 24 '11 at 03:23
  • 1
    see also http://unix.stackexchange.com/q/108100/105116 – Andy Hayden Mar 03 '15 at 07:19
  • Detailed benchmark: http://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO – Vadzim Oct 28 '16 at 16:08

9 Answers9

69

"Lowest Common Denominator". The extra space saved is rarely worth the loss of interoperability. Most embedded Linux systems have gzip, but not xz. Many old system as well. Gnu Tar which is the industry standard supports flags -z to process through gzip, and -j to process through bzip2, but some old systems don't support the -J flag for xz, meaning it requires 2-step operation (and a lot of extra diskspace for uncompressed .tar unless you use the syntax of |tar xf - - which many people don't know about.) Also, uncompressing the full filesystem of some 10MB from tar.gz on embedded ARM takes some 2 minutes and isn't really a problem. No clue about xz but bzip2 takes around 10-15 minutes. Definitely not worth the bandwidth saved.

SF.
  • 12,380
  • 11
  • 65
  • 102
  • 6
    I can decompress xz archives with `tar xvf archive.tar.xz` just fine here. – Artefact2 Jun 27 '11 at 13:30
  • 1
    It's supported xz format [since March 2009](http://www.gnu.org/software/tar/#TOCreleases). Of course it'd take another 6-12 months for the new version to filter into Linux distros etc., or longer on enterprise Linuxes. – Rup Jun 27 '11 at 13:52
  • @Artefact2 - wait, so how do I differentiate between plain .tar and .tar.xz if no filename is available? Say, with limited local diskspace sending archive on the fly `tar cvf - dir | ssh user@remotehost "cat >archive.tar.xz"` ? – SF. Jun 27 '11 at 14:07
  • 23
    @SF: while your conclusion *might* have been valid Jun 2011, I would hardly call bzip2 the "modern alternative" presently. `xz` is the "modern alternative", for when you previously would compress with `bzip2`. I believe even `xz` with a very low profile, `-1 - -3`, performs faster and compresses better, when compared to `bzip2`. Check around the \`net for published stats, if one is so inclined. – J. M. Becker Jan 21 '13 at 18:26
  • 7
    Modern versions of tar support the uppercase "J" as xz flag: http://linux.die.net/man/1/tar – Diego May 14 '13 at 14:38
  • 18
    xz is much faster than bzip2 in decompression: http://lists.fedoraproject.org/pipermail/infrastructure/2010-August/009389.html – Diego May 14 '13 at 14:40
  • @Diego: +1, +1, Thanks for posting that URL! I knew hard stats were floating around online. – J. M. Becker Jul 15 '13 at 17:49
  • 1
    As a matter of policy, I never use new technology when the old technology is as good or almost as good. The new technology needs to be *significantly* better than the old for my application before I'm willing to break backward compatibility. – Edward Falk Jul 26 '13 at 00:08
  • @TechZilla indeed xz is faster **and** better than bzip2 on lower profiles: http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO . On the default profile `xz` is "only" ~2x slower than `bzip2`. `xz -2` is faster (**both** compression and decompression) and better than `bzip2 -5`. The front which `xz -2` loses badly is memory usage for compression (only, the decompression is only slightly larger than bzip2). So unless portability is a concern, `xz -2` is a winner over `bzip2`. – Hendy Irawan Dec 26 '13 at 18:41
  • 1
    @HendyIrawan: Can I force tar to use `xz -2` instead of plain, or do I need to perform compression as 2-step operation in that case? – SF. Dec 26 '13 at 20:08
  • 7
    Linux Kernels Archive has now dumped bz2 for xz completely: https://www.kernel.org/happy-new-year-and-good-bye-bzip2.html but gz continues to be available! – aalaap Dec 28 '13 at 05:22
  • 1
    @SF: Yup, you just gotta set an environment variable. For example to make `tar -J` xz compress at profile 2, add this export to your user's `.bashrc` file. `export XZ_OPT='-2'` – J. M. Becker Jan 25 '14 at 01:02
  • "unless you use the syntax of `|tar xf -` - which many people don't know about". do you mean many unix/linux users don't know about _pipes_...? if so, woe betide us all. aaanyway, why type the pointless construct `f -` when specifying no in/outfile already does the same thing – underscore_d Oct 24 '15 at 14:15
  • 1
    @underscore_d: well, I'm pretty sure some 10 years ago, when I was learning that, 'f -' was obligatory, otherwise tar would attempt to access the default tape device and fail not finding one. – SF. Oct 24 '15 at 21:44
  • one machine I use at work, doesn't support `-j` in `tar`, but it has `-I` for `bzip2`. – M.M Sep 13 '16 at 10:32
  • tar has the option to specify arbitrary compression programs with `--use-compress-program=prog`, so I'm not sure I agree with the argument of tar doesn't support certain compression types. It's possible this is a relatively new option to tar though. – airfishey Mar 10 '17 at 15:08
  • 1
    @airfishey: tar provided by busybox, a standard on most embedded devices, usually has maybe 10 options total. And I'm fairly sure `--use-compress-program=` is not one of them. – SF. Mar 10 '17 at 16:18
  • @SF. I totally agree that busybox utilities are pretty stripped down, with a minimal set of options. I didn't originally realize that you specifically talking about embedded Linux in your answer. Your answer is valid in that context. I would just state that most graphical distributions probably have the `---use-compress-program` option for tar. Indeed GNU tar, which you also mention in your answer, also has the `--use-compress-program` option. Thanks for bringing up the busybox point. – airfishey Mar 10 '17 at 17:21
  • @airfishey: "lowest common denominator" - *all* the possible clients. From most advanced server farms to thumb-drive sized USB dongles that happen to run Linux. – SF. Mar 10 '17 at 17:51
  • @J.M.Becker That's the problem with stackexchange. Old answers stay on the top. Forever. What may have been true in 2011 is completely irrelevant to time travelers like me from 2019. – SurpriseDog Jul 13 '19 at 00:42
67

The ultimate answer is accessibility, with a secondary answer of purpose. Reasons why XZ is not necessarily as suitable as Gzip:

  • Embedded and legacy systems are far more likely to lack sufficient available memory to decompress LZMA/LZMA2 archives such as XZ. As an example, if XZ can shave 400 KiB (vs. Gzip) off of a package destined for an OpenWrt router, what good is the minor space savings if the router has 16 MiB of RAM? A similar situation appears with very old computer systems. One might scoff at the thought of downloading and compiling the latest version of Bash on an ancient SparcStation LX with 32MB of RAM, but it happens.

  • Such systems usually have slow processors, and decompression time increases can be very high. Three seconds extra to decompress on your Core i5 can be severely long on a 200 MHz ARM core or a 50 MHz microSPARC. Gzip compression is extremely fast on such processors when compared to all better compression methods such as XZ or even Bzip2.

  • Gzip is pretty much universally supported by every UNIX-like system (and nearly every non-UNIX-like system too) created in the past two decades. XZ availability is far more limited. Compression is useless without the ability to decompress it.

  • Higher compression takes a lot of time. If compression time is more important than compression ratio, Gzip beats XZ. Honestly, lzop is much faster than Gzip and still compresses okay, so applications that need the fastest compression possible and don't require Gzip's ubiquity should look at that instead. I routinely shuffle folders quickly across a trusted LAN connection with commands such as "tar -c * | lzop -1 | socat -u - tcp-connect:192.168.0.101:4444" and Gzip could be used similarly over a much slower link (i.e. doing the same thing I just described through an SSH tunnel over the Internet).

Now, on the flip side, there are situations where XZ compression is vastly superior:

  • Sending data over slow links. The Linux 3.7 kernel source code is 34 MiB smaller in XZ format than in Gzip format. If you have a super fast connection, choosing XZ could mean saving one minute of download time; on a cheap DSL connection or a 3G cellular connection, it could shave an hour or more off the download time.

  • Shrinking backup archives. Compressing the source code for Apache's httpd-2.4.2 with "gzip-9" vs. "xz -9e" yields an XZ archive that is 62.7% the size of the Gzip archive. If the same compressibility exists in a data set you currently store as 100 GiB worth of .tar.gz archives, converting to .tar.xz archives would cut a whopping 37.3 GiB off of the backup set. Copying this entire backup data set to a USB 2.0 hard drive (maxing out around 30 MiB/sec transfers) as Gzipped data would take 55 minutes, but XZ compression would make the backup take 20 minutes less. Assuming you'll be working with these backups on a modern desktop system with plenty of CPU power and the one-time-only compression speed isn't a serious problem, using XZ compression generally makes more sense. Why shuffle around extra data if you don't need to?

  • Distributing large amounts of data that might be highly compressible. As previously mentioned, Linux 3.7 source code is 67 MiB for .tar.xz and 101 MiB for .tar.gz; the uncompressed source code is about 542 MiB and is almost entirely text. Source code (and text in general) is typically highly compressible because of the amount of redundancy in the contents, but compressors like Gzip that operate with a much smaller dictionary don't get to take advantage of redundancy that goes beyond their dictionary size.

Ultimately, it all falls back to a four-way tradeoff: compressed size, compression/decompression speed, copying/transmission speed (reading the data from disk/network), and availability of the compressor/decompressor. The selection is highly dependent on the question "what are you planning to do with this data?"

Also check out this related post from which I learned some of the things I repeat here.

Jody Bruchon
  • 835
  • 6
  • 10
  • 10
    People should tread carefully, if compressing with `xz`, before setting the `-9` profile. This is an except from the `xz` man page, "The differences between the presets are more significant than with gzip(1) and bzip2(1). The selected compression settings determine the memory requirements of the decompressor, thus using a too high preset level might make it painful to decompress the file on an old system with little RAM. Specifically, it's not a good idea to blindly use -9 for everything like it often is with gzip(1) and bzip2(1)." – J. M. Becker Jan 21 '13 at 18:21
  • @TechZilla is definitely correct; the XZ algorithm is very memory-intensive, and one must take into account the decompression target where with other algorithms it might not be a concern. Reading the man pages, one can see a chart listing the dictionary size of each numbered profile; using any number higher than the one whose dictionary size minimally exceeds your total data size gains nothing at all, while significantly increasing decompressor memory usage. For most smaller data sets, `xz -2e` seems to work well. – Jody Bruchon Feb 02 '13 at 18:46
  • 1
    I like how in 2012, a super fast connection would spend 1 minute to download 34 MiB. Ah, the good old days of slow internet. – Jørgen R Oct 12 '15 at 13:34
  • in 2019 a super fast connection spends under 1 second to download 34MiB, if the server can keep up, that is. :) – Steven Lu Dec 17 '19 at 22:17
13

I did my own benchmark on 1.1GB Linux installation vmdk image:

rar    =260MB   comp= 85s   decomp= 5s
7z(p7z)=269MB   comp= 98s   decomp=15s
tar.xz =288MB   comp=400s   decomp=30s
tar.bz2=382MB   comp= 91s   decomp=70s
tar.gz =421MB   comp=181s   decomp= 5s

all compression levels on max, CPU Intel I7 3740QM, Memory 32GB 1600, source and destination on RAM disk

I Generally use rar or 7z for archiving normal files like documents.
and for archiving system files I use .tar.gz or .tar.xz by file-roller or tar with -z or -J options along with --preserve to compress natively with tar and preserve permissions (also alternatively .tar.7z or .tar.rar can be used)

update: as tar only preserve normal permissions and not ACLs anyway, also plain .7z plus backup and restoring permissions and ACLs manually via getfacl and sefacl can be used which seems to be best option for both file archiving or system files backup because it will full preserve permissions and ACLs, has checksum, integrity test and encryption capability, only downside is that p7zip is not available everywhere

  • Wow, RAR is impressive. I do wonder if this is not a fair comparison, since the disc image may already be gzipped? (That's just my guess?). Seems to disagree with http://superuser.com/a/205984/210528 :s Or has RAR been improving?? – Andy Hayden Mar 03 '15 at 07:26
  • input file was a virtual hard disk image without compression. however I agree compression ratio is very variable and depends on many factors, there are many cases that 7z(lzma) or xz(lzma2) will make better compression ratio than rar, but within same compression ratio rar is usually faster. – Microsoft Linux TM Mar 03 '15 at 08:36
  • 1
    If use with pigz (multi-process support version gzip), speed is faster remarkable than old gzip/rar/7z. on my test (about 4.8GB source code files) not use pigz, real 1m50.754s user 1m50.360s, and file size is will little difference. sys 0m3.716s, if use pigz: real 0m23.514s user 2m55.220s sys 0m3.964s – zw963 Jan 24 '16 at 07:33
  • "tar only preserve normal permissions and not ACLs anyway" `man tar` -> `--acls --xattrs --selinux` # surely, one of these will help. – Kent Fredric Feb 13 '20 at 00:16
13

From the author of Lzip compressing utility:

Xz has a complex format, partially specialized in the compression of executables and designed to be extended by proprietary formats. Of the four compressors tested here, xz is the only one alien to the Unix concept of "doing one thing and doing it well". It is the less appropriate for data sharing, and not appropriate at all for long-term archiving.

In general, the more complex the format, the less probable that it can be decoded in the future. But the xz format, just as its infamous predecessor lzma-alone, is specially badly designed. Xz copies almost all the defects of gzip and then adds some more, like the fragile variable-length integers. Just one bit-flip in bit 7 of any byte of one variable-length integer and the whole xz stream comes tumbling down like a house of cards. Using xz for anything other than compressing short-lived executables is not advisable.

Don't interpret me wrong. I am very grateful to Igor Pavlov for inventing/discovering LZMA, but xz is the third attempt of his followers to take advantage of the popularity of 7zip and replace gzip and bzip2 with inappropriate or badly designed formats. In particular, it is shameful that support for lzma-alone was implemented in both GNU and Linux.

http://www.nongnu.org/lzip/lzip_benchmark.html

Tobi Nary
  • 4,387
  • 4
  • 26
  • 48
Harri Järvi
  • 131
  • 1
  • 2
8

Honestly, I just get to know .xz format from a training material. So I just used its git repo to do a test. The git is git://git.free-electrons.com/training-materials.git, and I also compiled the three training slides. The total directory size is 91M, with a mixture of text and binary data.

Here is my quick result. Maybe people still favor tar.gz simply because it's much faster to compress? I personally even use plain tar when there aren't many benefits to be gained in compression.

[02:49:32]wujj@WuJJ-PC-Linux /tmp $ time tar czf test.tgz training-materials/

real    0m3.371s
user    0m3.208s
sys     0m0.128s
[02:49:46]wujj@WuJJ-PC-Linux /tmp $ time tar cJf test.txz training-materials/

real    0m34.557s
user    0m33.930s
sys     0m0.372s
[02:50:31]wujj@WuJJ-PC-Linux /tmp $ time tar cf test.tar training-materials/

real    0m0.117s
user    0m0.020s
sys     0m0.092s
[02:51:03]wujj@WuJJ-PC-Linux /tmp $ ll test*
-rw-rw-r-- 1 wujj wujj 91944960 2012-07-09 02:51 test.tar
-rw-rw-r-- 1 wujj wujj 69042586 2012-07-09 02:49 test.tgz
-rw-rw-r-- 1 wujj wujj 60609224 2012-07-09 02:50 test.txz
[02:56:03]wujj@WuJJ-PC-Linux /tmp $ time tar xzf test.tgz

real    0m0.719s
user    0m0.536s
sys     0m0.144s
[02:56:24]wujj@WuJJ-PC-Linux /tmp $ time tar xf test.tar

real    0m0.189s
user    0m0.004s
sys     0m0.108s
[02:56:33]wujj@WuJJ-PC-Linux /tmp $ time tar xJf test.txz

real    0m3.116s
user    0m2.612s
sys     0m0.184s
wujj123456
  • 519
  • 3
  • 8
3

gz is supported everywhere and good for portability.

xz is newer and now as widely or well supported. It is more complex than gzip with more compression options.

This is not the only reason people might not always use xz. xz can take a very long time to compress, not a trivial amount of time so even if it can produce superior results it might not always be chosen. Another weakness is that it can use a lot of memory, especially for compression. The more you want to compress an item by the longer it takes and this is exponential with diminishing returns.

However, at compression level 1 for large binary items in my experience xz can often produce much smaller results in less time than zlib at level 9. This can sometimes be a very significant difference, in the same time as zlib, xz can make a file that is half the size of zlib's file.

bzip2 is in a similar situation, however xz has far superior advantages and a strong window where it performs significantly better all round.

Community
  • 1
  • 1
jgmjgm
  • 3,028
  • 1
  • 20
  • 18
3

For the same reason people in Windows (r) use zip files instead of 7zip, and some still use rar instead of other formats... Or mp3 is used in music, instead of aac+, and so on.

Each format has it's benefits and people use to stick to a solution they learned when began using a computer. Add this to backward compatibility and fast bandwidth + GB or TB of space in hard drives, and the benefits of a greater compression won't be that relevant.

woliveirajr
  • 9,213
  • 1
  • 36
  • 48
  • 1
    But internet links don't grow as fast as hard disks... – jørgensen Dec 24 '11 at 03:24
  • 1
    Yea I've heard that line at work before also... "but the storage is so cheep yadda yadda"..."Who cares if we piss away oodles of space".. But even if you believe that, and don't value things like a quick backup, bandwidth is indeed a completely different animal. Especially if you're serving a huge tarball to countless downloaders. – J. M. Becker Jan 25 '14 at 01:06
1

Also one important point for gzip is that it's interoperable with rsync/zsync. This could be huge benefit regarding bandwidth in cases. LZMA/bzip2/xz doesn't support rsync and probably won't support it anytime soon.
One of characteristics of LZMA is that it uses quiet large window. To make it rsync/zsync friendly we would probably need to reduce this window which would degrade it's compression performance.

Ondrej Bozek
  • 9,501
  • 6
  • 48
  • 66
1

Yeah the thought I had is that the original question could be reposed these days as "why is tar.gz more common than tar.lz" (since lz seems to compress slightly better than xz, xz is said to be a poor choice for archiving, though does offer some nice features like random access). I suppose the answer is "momentum" people are used to using it, there's good library support, etc.etc. The introduction of lz may mean that xz will grow less fast now, as well, FWIW...

However, that being said, lz appears to decompress slower than xz, and there are new things on the horizon like Brotli so it's unclear what will happen in terms of popularity...but I have seem a few .lz files in the wild FWIW...

rogerdpack
  • 50,731
  • 31
  • 212
  • 332