Walking or Traversing the Filesystem (Perl for System Administration)

2.3. Walking or Traversing the Filesystem

By now you are probably itching to get to some practical applications of Perl. We'll begin by examining the process of "walking the filesystem," one of the most common system administration tasks associated with filesystems. Typically this entails searching an entire set of directory trees and taking action based on the files or directories we find. Each OS provides a tool for this task. Under Unix it is the find command, under NT and Windows 2000 it is Find Files or Folders or Search For Files or Folders, and in MacOS it is Find File or Sherlock. All of these commands are useful for searching, but they lack the power by themselves to perform arbitrary and complex operations as they encounter their desired search targets. We're going to see how Perl allows us to write more sophisticated file walking code beginning with the very basics and ratcheting up the complexity as we go on.

To get started, let's take a common scenario that provides a clear problem for us to solve. In this scenario, we're a Unix system administrator with overflowing user filesystems and an empty budget. (Unix is being picked on first, but the other operating systems will get their turns in a moment.)

We can't add more disk space without any money, so we've got to make better use of our existing resources. Our first step is to remove all the files on our filesystems that can be eliminated. Under Unix, good candidates for elimination are the core files left around by programs that have died nasty deaths. Most users either do not notice that these files are created, or just ignore them in their directory, leaving large amounts of disk space claimed for no reason. We need something to search through a filesystem and delete these varmints.

To walk a filesystem, we start by reading the contents of a single directory and work our way down from there. Let's ease into the process and begin with code that examines the contents of the current directory and reports on either a core file or another directory to be searched.

We start by opening the directory using roughly the same syntax used for opening a file. If the open fails, we exit the program and print the error message set by the opendir( ) call ($!):

opendir(DIR,".") or die "Can't open the current directory: $!\n";

This provides us with a directory handle, DIR in this case, which we can pass to readdir( ) to get a list of all of the files and directories in the current directory. If readdir( ) can't read that directory, our code prints an error message (which hopefully explains why it failed) and the program exits:

# read file/directory names in that directory into @names 
@names = readdir(DIR) or die "Unable to read current dir:$!\n";

We then close the open directory handle:

closedir(DIR);

Now we can work with those names:

foreach $name (@names) {
   next if ($name eq ".");   # skip the current directory entry
   next if ($name eq "..");  # skip the parent  directory entry

   if (-d $name){            # is this a directory?
      print "found a directory: $name\n";
      next;                  # can skip to the next name in the for loop 
   }
   if ($name eq "core") {    # is this a file named "core"?
     print "found one!\n"; 
  }
}

Now we have some very simple code that scans a single directory. This isn't even "crawling" a filesystem, never mind walking it. To walk the filesystem we'll have enter all of the directories we find as we scan and look at their contents as well. If these subdirectories have subdirectories, we'll need to check them out also.

Whenever you have a hierarchy of containers and an operation that gets performed the exact same way on every container and subcontainer in that hierarchy, this calls out for a recursive solution (at least to computer science majors). As long as the hierarchy is not too deep and doesn't loop back upon itself (i.e., all containers hold only their immediate children and do not reference some other part of the hierarchy), recursive solutions tend to make the most sense. This is the case with our example; we're going to be scanning a directory, all of its subdirectories, all of their subdirectories, and so on.

If you've never seen recursive code (i.e., code that calls itself), you may find it a bit strange at first. Recursive code is a bit like the process of painting a set of the Matreskha nesting Russian dolls. These are the dolls that contain another smaller doll of the exact same shape, that contains another doll, and so on until you get to a very small doll in the center.

A recipe for painting these dolls might go something like this:

Examine the doll in front of you. Does it contain a smaller doll? If so, remove the contents and set aside the outer doll.
Repeat step 1 with the contents you just removed until you reach the center.
Paint the center doll. When it is dry, put it back in its container and repeat step 3 with the next container.

The process is the same every step of the way. If the thing in your hand has sub-things, put off dealing with what you have in hand and deal with the sub-things first. If the thing you have in your hand doesn't have sub-things, do something with it, and then return to the last thing you put off.

In coding terms, this typically consists of a subroutine that deals with containers. It first looks to see if the current container has subcontainers. If it does, it calls itself to deal with the subcontainer. If it doesn't, it performs some operation and returns back from whoever called it. If you haven't seen code that calls itself, I recommend sitting down with a paper and a pencil and tracing the program flow until you are convinced it actually works.

Let's see some recursive code. To make our code recursive, we first encapsulate the operation of scanning a directory and acting upon its contents in a subroutine called ScanDirectory( ). ScanDirectory( ) takes a single argument, the directory it is supposed to scan. It figures out its current directory, enters the requested directory, and scans it. When it has completed this scan, it returns to the directory it was called from. Here's the new code:

#!/usr/bin/perl -s

# note the use of -s for switch processing. Under NT/2000, you will need to
# call this script explicitly with -s (i.e., perl -s script) if you do not
# have perl file associations in place. 
# -s is also considered 'retro', many programmers prefer to load
# a separate module (from the Getopt:: family) for switch parsing.

use Cwd; # module for finding the current working directory

# This subroutine takes the name of a directory and recursively scans 
# down the filesystem from that point looking for files named "core"
sub ScanDirectory{
    my ($workdir) = shift; 

    my ($startdir) = &cwd; # keep track of where we began

    chdir($workdir) or die "Unable to enter dir $workdir:$!\n";
    opendir(DIR, ".") or die "Unable to open $workdir:$!\n";
    my @names = readdir(DIR) or die "Unable to read $workdir:$!\n";
    closedir(DIR);

    foreach my $name (@names){
        next if ($name eq "."); 
        next if ($name eq "..");

        if (-d $name){                  # is this a directory?
            &ScanDirectory($name);
            next;
        }
        if ($name eq "core") {          # is this a file named "core"?
            # if -r specified on command line, actually delete the file
            if (defined $r){
	             unlink($name) or die "Unable to delete $name:$!\n";
            }
            else {
                print "found one in $workdir!\n"; 
            }
        }
        chdir($startdir) or 
           die "Unable to change to dir $startdir:$!\n";
    }
}

&ScanDirectory(".");

The most important change from the previous example is our code's behavior when it finds a subdirectory in the directory it has been requested to scan. If it finds a directory, instead of printing "found a directory!" as our previous sample did, it recursively calls itself to examine that directory first. Once that entire subdirectory has been scanned (i.e., the call to ScanDirectory( ) returns), it returns to looking at the rest of the contents of the current directory.

To make our code fully functional as a core file-destroyer, we've also added file deletion functionality to it. Pay attention to how that code is written: it will only delete files if the script is started with a certain command-line switch, -r (for remove).

We're using Perl's built-in -s switch for automatic option parsing as part of the invocation line (#!/usr/bin/perl -s). This is the simplest way to parse command-line options; for more sophistication, we'd probably use something from the Getopt module family. If a command-line switch is present (e.g., -r) then a global scalar variable with the same name (e.g., $r) is set when the script is run. In our code, if Perl is not invoked with -r, we revert to the past behavior of just announcing that we found a core file.

WARNING

When you write automatic tools, make destructive actions harder to perform. Take heed: Perl, like most powerful languages, allows you to nuke your filesystem without breaking a sweat.

Now, lest the NT/2000-focused readers think the previous example didn't apply to them, let me point out this code could be useful for them as well. A single line change from:

if ($name eq "core") {

to:

if ($name eq "MSCREATE.DIR") {

will create a program that deletes the annoying hidden zero-length files certain Microsoft program installers leave behind.

With this code under our belt, let's return to the quandary that started this chapter. After my laptop kissed the floor, I found myself in desperate need of a way to determine which files could be read off the disk and which files were damaged.

Here's the actual code I used:

use Cwd; # module for finding the current working directory
$|=1;    # turn off I/O buffering

sub ScanDirectory {
    my ($workdir) = shift; 

    my($startdir) = &cwd; # keep track of where we began

    chdir($workdir) or die "Unable to enter dir $workdir:$!\n";

    opendir(DIR, ".") or die "Unable to open $workdir:$!\n";
    my @names = readdir(DIR);
    closedir(DIR);
 
    foreach my $name (@names){
        next if ($name eq "."); 
        next if ($name eq "..");
	
        if (-d $name){                     # is this a directory?
            &ScanDirectory($name);
            next;
        }
        unless (&CheckFile($name)){           
            print &cwd."/".$name."\n"; # print the bad filename
        }
    }
    chdir($startdir) or die "Unable to change to dir $startdir:$!\n";
}

sub CheckFile{
    my($name) = shift;
    
    print STDERR "Scanning ". &cwd."/".$name."\n";
    # attempt to read the directory entry for this file
    my @stat = stat($name);
    if (!$stat[4] && !$stat[5] && !$stat[6] && !$stat[7] && !$stat[8]){
        return 0;
    }
    # attempt to open this file
    unless (open(T,"$name")){
        return 0;
    }	
    # read the file one byte at a time
    for (my $i=0;$i< $stat[7];$i++){
        my $r=sysread(T,$i,1);
        if ($r !=1) {
            close(T);
            return 0;
        }
    }
    close(T);
    return 1;
}
 
&ScanDirectory(".");

The difference between this code and our last example is the addition of a subroutine to check each file encountered. For every file, we use the stat function to see if we can read that file's directory information (e.g., its size). If we can't, we know the file is damaged. If we can read the directory information, we attempt to open the file. And for a final test, we attempt to read every single byte of the file. This doesn't guarantee that the file hasn't been damaged (the contents could have been modified) but it does at least show that the file is readable.

You may wonder why we use an esoteric function like sysread( ) to read the file instead of using < > or read( ), Perl's usual file reading operator and function. sysread( ) gives us the ability to read the file byte-by-byte without any of the usual buffering. If a file is damaged at location X, we don't want to waste time waiting for the standard library routines to attempt to read the bytes at location X+1, X+2, X+3, etc., as part of their usual pre-fetch. We want the code to quit trying to read the file immediately. In general, you really want file reads to fetch whole chunks at a time for performance sake, but here that's undesirable because it means the laptop would spend long prolonged periods of time making awful noises every time it found a damaged file.

Now that you've seen the code I used, let me offer some closure to this story. After the script you just saw ran all night long (literally), it found 95 bad files out of 16,000 total. Fortunately, none of those files were files from the book you are now reading; I backed up the good files and moved them. Perl saved the day.