Photo by Anna Evans on Unsplash

Why I prefer tar to zip

I love having choices when it comes to computing, and especially in the world of open source we’re spoilt when it comes to archiving files. There’s TAR, ZIP, GZIP, BZIP2, XZ, 7Z, AR, ZOO, and more. Of all compression formats, it seems that ZIP has gained ubiquity. It’s the one you can use to archive and extract data on nearly every system, including Linux, UNIX, FreeDOS, Android, Windows, macOS, and more. The problem is, ZIP isn’t the best tool for the job of archival. Here’s why I use TAR instead of ZIP whenever possible.

Each archiving format has an associated command, such as tar, zip, gzip and gunzip, xz, and so on. In terms of compression, they all tend to be basically the same at this point. You might save a few kilobytes or megabytes with one compression algorithm given a specific combination of file types, but it’s fair to say that they all result in broadly similar results. Where they differ is in what each command makes available, and what each file format retains.

The tar and zip command showdown

At first glance, tar and zip are similar in capability.

By default, the tar command generates an archive that’s not compressed. It’s just a single file object that contains smaller file objects within it. The resulting object is basically the same size as the sum of its parts:

$ tar --create --file archive.tar pic.jpg file.txt
$ ls -lG
-rw-r--r-- 1 tux 46049280 Jan  7 10:55 archive.tar
-rw-r--r-- 1 tux 45965374 Jan  7 10:55 file.txt
-rw-r--r-- 1 tux    77673 Jan  7 08:34 pic.jpg

You can use the -0 option to simulate this with the zip command:

$ zip -0 archive.zip pic.jpg file.txt
  adding: pic.jpg (stored 0%)
  adding: file.txt (stored 0%)
$ ls -lG
$ ls -lG
-rw-r--r-- 1 tux 46049280 Jan  7 10:55 archive.tar
-rw-r--r-- 1 tux 46043355 Jan  7 10:57 archive.zip
-rw-r--r-- 1 tux 45965374 Jan  7 10:55 file.txt
-rw-r--r-- 1 tux    77673 Jan  7 08:34 pic.jpg

The most common use case of each command, however, definitely includes compression.

Level of compression

The balance in choosing either an algorithm (in the case of tar) or a compression level (in the case of zip is between compression speed and size. In theory, the slower you let the command compress, the smaller the resulting archive. The faster the compression, the bigger the archive.

Both commands strive to provide you with some control over this.

By default (without the -0 option), the zip command also compresses the archive it has created. You can adjust the amount of compression with an option ranging from -0 to -9. The default level is -6.

To add compression to the tar command, you can either use a separate command entirely to compress the resulting TAR file, or you can one of several options to choose what compression algorithm gets applied to the TAR file during its creation. Here’s an incomplete list:

  • -z or --gzip: Filters the archive through gzip
  • -j or --bzip2: Filters the archive through bzip2
  • -J or --xz: Filters the archive through xz
  • --lzip: Filters the archive through lzip
  • -Z or --compress: Filters the archive through compress
  • --zstd: Filters the archive through zstd
  • --no-auto-compress: Prevents tar from using the archive suffix to determine the compression program so you can specify one (or not) yourself

Decoupling the process of archiving from compression makes sense to me. While the zip command is stuck with basically the same old algorithm year after year, a TAR archive can be compressed using whatever compression algorithm you think is best. In some cases, you might make that determination based on the type of data you’re compressing, or you might be limited to the capabilities of your target system, or you might just want to test a hot new compression algorithm.

Here’s what the zip command does with a 44 MB text file and a JPEG file, at maximum compression:

$ zip -9 archive.zip file.txt pic.jpg 
  adding: file.txt (deflated 90%)
  adding: pic.jpg (deflated 14%)
$ ls -lG
-rw-r--r-- 1 tux 4.4M Jan  7 11:17 archive.zip
-rw-r--r-- 1 tux  44M Jan  7 10:55 file.txt
-rw-r--r-- 1 tux  76K Jan  7 08:34 pic.jpg

A compressed archive of 4.4 MB down from a little more than 44 MB isn’t bad.

Similarly, the tar command with the --gzip option produces a 4.5 MB archive. However, filtering tar through --xz makes a significant improvement:

$ tar --create --xz --file archive.tar.xz file.txt pic.jpg 
$ ls -lG
-rw-r--r-- 1 tux users 3.3M Jan  7 11:17 archive.tar.xz
-rw-r--r-- 1 tux users  44M Jan  7 10:55 file.txt
-rw-r--r-- 1 tux users  76K Jan  7 08:34 pic.jpg

At 3.3 MB, it seems that a newer compression algorithm has outperformed ZIP, at least in this particular test. I’m the first to admit that compression tests are subject to many variables, so it’s not globally significant that XZ has done better than ZIP in this one example. With some experimentation, I could [probably] devise a test that gets better results from ZIP. However, this example does demonstrate that it’s useful having an archive tool that is modular enough to allow for the development of new algorithms.

Output manipulation

When you extract data from a TAR or ZIP archive, you can choose to either extract specific files or to extract everything all at once. I believe it’s most common to extract everything, because that’s the default behaviour on major desktops like GNOME and macOS. With both the tar and unzip commands, even when you choose to extract everything all at once, you still have a choice of where to put the files you’ve extracted.

By default, both the tar and unzip commands extract all files into the current directory. If the archive itself contains a directory, then that directory serves as a “container” for the extracted files. Otherwise, the files appear in your current directory. This can get messy, but it’s a common enough problem that Linux and UNIX users call it a “tarbomb” because it sometimes feels like an archive has exploded and left file shrapnel in its wake.

However, a tarbomb (or zipbomb) isn’t inherently bad. It’s a valid use case when you want to essentially overlay updated or additional files into an existing file system. For example, suppose you have a website consisting of several PHP files across several directories. You can take a copy of the site to your development machine to make updates, and then create an archive of the files you’ve updated. Extract the archive on your web server, and each new version of any file is extracted exactly where it originated from because both tar and unzip retain the filesystem’s structure. I use this feature when doing dot-release updates of several different content management systems, and it makes maintenance pleasantly simply.

Both the unzip and tar commands provide an option to change directory before extraction so you can store an archive in one directory but send extracted files to a different location.

Use the --directory option with the tar command:

$ mkdir mytar
$ tar --extract --file archive.tar.xz --directory ./mytar
$ ls ./mytar
file.txt   pic.jpg

Use the -d option with unzip:

$ mkdir myzip
$ unzip archive.zip -d ./myzip
$ ls ./myzip
file.txt   pic.jpg

The feature unzip doesn’t have is the ability to drop directories from the archive before extraction. For example, suppose you want to extract files directly into myzip, but you’ve been given an archive containing a leading directory called chaff:

$ unzip archive+chaff.zip -d ./myzip
$ ls ./myzip
chaff
$ ls ./myzip/chaff
file.txt   pic.jpg

You don’t want chaff, but there’s no option in unzip to skip it.

Frustratingly, the unzip command essentially encourages this anti-pattern. In order to avoid delivering a zipbomb to someone, you thoughtfully nest your files in a useless folder. But by nesting everything in a useless folder, you’ve also prevented your user from extracting only the files required.

The tar command solves this problem elegantly. You can protect your users from a tarbomb by nesting your files in a useless directory because tar allows any user to skip over any number of leading directories.

$ tar --extract --strip-components=1 \
  --file archive+chaff.tar.xz --directory ./mytar
$ ls ./mytar
file.txt  pic.jpg

Permission and ownership

The ZIP file format doesn’t preserve file ownership. The TAR file format does.

You might not notice this when using ZIP or TAR archives just on your own personal systems. Once a file is extracted, you own the file. However, using tar as a superuser or with the --same-owner option extracts each file with the same ownership it had when archived, assuming the same user and group is available on the system. There’s no option for that with unzip command because the ZIP file format doesn’t track ownership.

The zip command can preserve file permissions, but again tar offers a lot more flexibility. The --same-permissions, --no-same-permissions, and --mode options let you control the permissions assigned to archived files.

Better archiving with tar

It’s easy to use either ZIP or TAR interchangeably, because for most general purpose activities their default behaviour is similar and suitable. However, if you’re using archives for mission critical work involving disparate systems and a diverse set of people, TAR is the technically superiour choice. Whether TAR is the “correct” choice depends entirely on your target audience, because there’s no doubt that ZIP has greater support. But all things being equal, TAR is the archive format and tar is the archive command I prefer.

Leave a Reply