15.7. Save Space: tar and compress a Directory Tree
In the Unix filesystem, files are stored in blocks. Each nonempty file, no matter how small, takes up at least one block. A directory tree full of little files can fill up a lot of partly empty blocks. A big file is more efficient because it fills all (except possibly the last) of its blocks completely.
The tar (Section 39.2) command can read lots of little files and put them into one big file. Later, when you need one of the little files, you can extract it from the tar archive. Seems like a good space-saving idea, doesn't it? But tar, which was really designed for magnetic tape archives, adds "garbage" characters at the end of each file to make it an even size. So, a big tar archive uses about as many blocks as the separate little files do.
Okay, then why am I writing this article? Because the gzip (Section 15.6) utility can solve the problems. It squeezes files down -- compressing them to get rid of repeated characters. Compressing a tar archive typically saves 50% or more. The bzip2 (Section 15.6) utility can save even more.
WARNING: If your compressed archive is corrupted somehow -- say, a disk block goes bad -- you could lose access to all of the files. That's because neither tar nor compression utilities recover well from missing data blocks. If you're archiving an important directory, be sure you have good backup copies of the archive.
Making a compressed archive of a directory and all of its subdirectories is easy: tar copies the whole tree when you give it the top directory name. Just be sure to save the archive in some directory that won't be copied -- so tar won't try to archive its own archive! I usually put the archive in the parent directory. For example, to archive my directory named project, I'd use the following commands. The .tar.gz extension isn't required, but is just a convention; another common convention is .tgz. I've added the gzip - -best option for more compression -- but it can be a lot slower, so use it only if you need to squeeze out every last byte. bzip2 is another way to save bytes, so I'll show versions with both gzip and bzip2. No matter what command you use, watch carefully for errors:
% cd project % tar clf - . | gzip --best > ../project.tar.gz % gzcat ../project.tar.gz | tar tvf -Quick verification % tar clf - . | bzip2 --best > ../project.tar.bz2 % bzcat ../project.tar.bz2 | tar tvf -Quick verification % cd .. % rm -r project
Go to http://examples.oreilly.com/upt3 for more information on: tar
If you have GNU tar or another version with the z option, it will run gzip for you. This method doesn't use the gzip - -best option, though -- so you may want to use the previous method to squeeze out all you can. Newer GNU tars have an I option to run bzip2. Watch out for other tar versions that use -I as an "include file" operator -- check your manpage or tar -- help. If you want to be sure that you don't have a problem like this, use the long options ( -- gzip and -- bzip2) because they're guaranteed not to conflict with something else; if your tar doesn't support the particular compression you've asked for, it will fail cleanly rather than do something you don't expect.
Using the short flags to get compression from GNU tar, you'd write the previous tar command lines as follows:
tar czlf ../project.tar.gz . tar cIlf ../project.tar.bz2 .
In any case, the tar l (lowercase letter L) option will print messages if any of the files you're archiving have other hard links (Section 10.4). If a lot of your files have other links, archiving the directory may not save much disk space -- the other links will keep those files on the disk, even after your rm -r command.
less Section 12.3
% gzcat project.tar.gz | tar tvf - | less rw-r--r--239/100 485 Oct 5 19:03 1991 ./Imakefile rw-rw-r--239/100 4703 Oct 5 21:17 1991 ./scalefonts.c rw-rw-r--239/100 3358 Oct 5 21:55 1991 ./xcms.c rw-rw-r--239/100 12385 Oct 5 22:07 1991 ./io/input.c rw-rw-r--239/100 7048 Oct 5 21:59 1991 ./io/output.c ... % bzcat project.tar.bz2 | tar tvf - | less ... % tar tzvf project.tar.gz | less ... % tar tIvf project.tar.bz2 | less ...
% mkdir project % cd project % gzcat ../project.tar.gz | tar xf - % mkdir project % cd project % bzcat ../project.tar.bz2 | tar xf - % mkdir project % cd project % tar xzf ../project.tar.gz % mkdir project % cd project % tar xIf ../project.tar.bz2
Of course, you don't have to extract the files into a directory named project. You can read the archive file from other directories, move it to other computers, and so on.
You can also extract just a few files or directories from the archive. Be sure to use the exact name shown by the previous tar t command. For instance, to restore the old subdirectory named project/io (and everything that was in it), you'd use one of the previous tar command lines with the filename at the end. For instance:
% mkdir project % cd project % gzcat ../project.tar.gz | tar xf - ./io
Copyright © 2003 O'Reilly & Associates. All rights reserved.