15.7. Save Space: tar and compress a Directory Tree
In the Unix
filesystem, files are stored in blocks. Each nonempty file, no matter
how small, takes up at least one block.[47] A directory tree full of little files can
fill up a lot of partly empty blocks. A big file is more efficient
because it fills all (except possibly the last) of its blocks
completely.
The tar
(Section 39.2) command can read lots of little files
and put them into one big file. Later, when you need one of the
little files, you can extract it from the tar
archive. Seems like a good space-saving idea,
doesn't it? But tar, which was
really designed for magnetic tape
archives, adds
"garbage" characters at the end of
each file to make it an even size. So, a big tar
archive uses about as many blocks as the separate little files do.
Okay, then why am I writing this article? Because the gzip (Section 15.6) utility
can solve the problems. It squeezes files down -- compressing them
to get rid of repeated characters. Compressing a
tar archive typically saves 50% or more. The
bzip2 (Section 15.6)
utility can save even more.
WARNING:
If your compressed archive is corrupted somehow -- say, a disk
block goes bad -- you could lose access to
all of the files. That's
because neither tar nor compression utilities
recover well from missing data blocks. If you're
archiving an important directory, be sure you have good backup copies
of the archive.
Making a compressed archive of a directory and all of its
subdirectories is easy: tar copies the whole tree
when you give it the top directory name. Just be sure to save the
archive in some directory that won't be
copied -- so tar won't try to
archive its own archive! I usually put the archive in the parent
directory. For example, to archive my directory named
project, I'd use the following
commands. The .tar.gz extension
isn't required, but is just a convention; another
common convention is .tgz. I've
added the gzip - -best option for
more compression -- but it can be a lot slower, so use it only if
you need to squeeze out every last byte. bzip2 is
another way to save bytes, so I'll show versions
with both gzip and bzip2. No
matter what command you use, watch carefully for errors:
.. Section
1.16, -r Section 14.16
% cd project
% tar clf - . | gzip --best > ../project.tar.gz
% gzcat ../project.tar.gz | tar tvf -Quick verification
% tar clf - . | bzip2 --best > ../project.tar.bz2
% bzcat ../project.tar.bz2 | tar tvf -Quick verification
% cd ..
% rm -r project
Go to http://examples.oreilly.com/upt3 for more information on: tar
If you have GNU tar or another version with the
z option, it will run gzip for
you. This method doesn't use the
gzip - -best option,
though -- so you may want to use the previous method to squeeze
out all you can. Newer
GNU
tars have an
I option to run bzip2. Watch out
for other tar versions that use
-I as an "include
file" operator -- check your manpage or
tar -- help. If you want to be sure that you
don't have a problem like this, use the long options
( -- gzip and -- bzip2)
because they're guaranteed not to conflict with
something else; if your tar
doesn't support the particular compression
you've asked for, it will fail cleanly rather than
do something you don't expect.
Using the short flags to get compression from GNU
tar, you'd write the previous
tar command lines as follows:
tar czlf ../project.tar.gz .
tar cIlf ../project.tar.bz2 .
In any case, the
tar l (lowercase letter L) option will print
messages if any of the files you're archiving have
other hard links (Section 10.4). If a lot of your files have other links,
archiving the directory may not save much disk space -- the other
links will keep those files on the disk, even after your rm
-r command.
Any time you want a list of the files in the archive, use
tar t or
tar tv:
less Section 12.3
% gzcat project.tar.gz | tar tvf - | less
rw-r--r--239/100 485 Oct 5 19:03 1991 ./Imakefile
rw-rw-r--239/100 4703 Oct 5 21:17 1991 ./scalefonts.c
rw-rw-r--239/100 3358 Oct 5 21:55 1991 ./xcms.c
rw-rw-r--239/100 12385 Oct 5 22:07 1991 ./io/input.c
rw-rw-r--239/100 7048 Oct 5 21:59 1991 ./io/output.c
...
% bzcat project.tar.bz2 | tar tvf - | less
...
% tar tzvf project.tar.gz | less
...
% tar tIvf project.tar.bz2 | less
...
To extract all the files from the
archive, type one of these tar command lines:
% mkdir project
% cd project
% gzcat ../project.tar.gz | tar xf -
% mkdir project
% cd project
% bzcat ../project.tar.bz2 | tar xf -
% mkdir project
% cd project
% tar xzf ../project.tar.gz
% mkdir project
% cd project
% tar xIf ../project.tar.bz2
Of course, you don't have to extract the files into
a directory named project. You can read the
archive file from other directories, move it to other computers, and
so on.
You can also extract just a few files
or directories from the archive. Be sure to use the exact name shown
by the previous tar t command. For instance, to
restore the old subdirectory named project/io
(and everything that was in it), you'd use one of
the previous tar command lines with the filename
at the end. For instance:
% mkdir project
% cd project
% gzcat ../project.tar.gz | tar xf - ./io
-- JP
 |  |  | 15.6. Compressing Files to Save Space |  | 15.8. How Much Disk Space? |
Copyright © 2003 O'Reilly & Associates. All rights reserved.
|