Walking the Filesystem Using the File::Find Module (Perl for System Administration)

2.4. Walking the Filesystem Using the File::Find Module

Now that we've seen the basics of filesystem walking, here's a faster and spiffier way to do it. Perl comes with a module called File::Find that allows Perl to emulate the Unix find command. The easiest way to begin using this module is to use the find2perl command to generate prototypical Perl code for you.

TIP

find2perl is not always easy to use on non-Unix Perl ports. For example, MacOS users either will need Macintosh Programmer's Workshop (MPW) to run it, or should modify the code to take @ARGV from a dialog box. Here's a code snippet from Chris Nandor, co-author of MacPerl: Power and Ease, to do this:

@ARGV = @ARGV ? @ARGV : split "\s", MacPerl::Ask("Arguments?");

All ports do have the File::Findmodule that find2perl and find.pl use, so this should not be a real problem. We'll show you how to call it directly later in this chapter.

For instance, let's say you need some code to search the /home directory for files named beesknees. The command line that uses the Unix find command is:

% find /home -name beesknees -print

Feed the same options to find2perl:

% find2perl /home -name beesknees -print

and it produces:

#!/usr/bin/perl
    eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
        if $running_under_some_shell;

require "find.pl";

# Traverse desired filesystems

&find('/home');

exit;

sub wanted {
    /^beesknees$/ && print("$name\n");
}

The find2perl-generated code is fairly straightforward. It loads in the necessary find.pl library with a require statement, then calls the subroutine &find( ) with the name of the starting directory. We'll discuss the purpose of the &wanted( ) subroutine in a moment, since that's where all of the interesting modifications we're about to explore will live.

Before we begin our modifications of this code, it's important to note a few things that may not be obvious just by looking at the sample output above:

The folks who have worked on the File::Find module have gone to considerable trouble to make this module portable across platforms. File::Find's internal routines work behind the scenes so the same Perl code for filesystem walking works for Unix, MacOS, NT, VMS, and so on.
Though the code find2perl generates looks like Perl Version 4 code on the surface (for example, it uses require to load a .pl file), find.pl actually sets up some Perl Version 5 aliases for the user. In general, it is useful to look under the hood whenever you use a module in your code. If you need to find the Perl source for a module already installed on your system, running either perl -V or the following code will show you the standard library directories for your installation:

%perl -e 'print join("\n",@INC,"")'

Let's talk about the &wanted( ) subroutine that we will modify for our own purposes. The&wanted( ) subroutine gets called with the current file or directory name by &find( ) (&File::Find::find( ) to be precise) once for every file or directory encountered during its filesystem walk. It's up to the code in &wanted( ) to select the "interesting" files or directories and operate on them accordingly. In the sample output above, it first checks to see if the file or directory name matches the string beesknees. If it matches, the && operator causes Perl to execute the print statement to print the name of the file that was found.

We'll have to address two practical concerns when we create our own &wanted( ) subroutines. Since &wanted( ) is called once per file or directory name, it is important to make the code in this subroutine short and sweet. The sooner we can exit the &wanted( ) subroutine, the faster the find routine can proceed with the next file or directory, and the speedier the overall program will run. It is also important to keep in mind the behind-the-scenes portability concerns we mentioned a moment ago. It would be a shame to have a portable &find( ) call an OS-specific &wanted( ) subroutine unless this is unavoidable. Looking at the source code for the File::Find module may offer some hints on how to avoid this situation.

For our first use of File::Find, let's rewrite our previous core-destroyer example and then extend it a bit. First we type:

% find2perl -name core -print

which gives us:

require "find.pl";

# Traverse desired filesystems

&find('.');

exit;

sub wanted {
    /^core$/ && print("$name\n");
}

Then we add -s to the Perl invocation line and modify the &wanted( ) subroutine:

sub wanted {
   /^core$/ && print("$name\n") && defined $r && unlink($name);
}

This gives us the desired deletion functionality when the user invokes the program with -r. Here's a tweak that adds another measure of protection to our potentially destructive code:

sub wanted {
   /^core$/ && -s $name && print("$name\n") && 
               defined $r && unlink($name);
}

It checks any file called core to see if it is a non-zero length file before printing the name or contemplating deletion. Sophisticated users sometimes create a link to /dev/null named core in their home directory to prevent inadvertent core dumps from being stored in that directory. The -s test makes sure we don't delete links or zero-length files by mistake. If we wanted to be even more diligent, we should probably make two additional checks:

Open and examine the file to confirm that it is an actual core file, either from within Perl or by calling the Unix file command. Determining whether a file is an authentic core dump file can be tricky when you have filesystems remotely mounted over a network by machines of different architectures, all with different core file formats.
Look at the modification date of the file. If someone is actively debugging a program that has core dumped, she may not be happy if you delete the core file out from under her.

Let's take a break from the Unix world for a bit and look at Mac- and NT/2000-specific examples. Earlier in this chapter I mentioned that every file in a MacOS HFS filesystem has two attributes, creator and type, that allow the OS to determine which application created it and what kind of file it is. These attributes are stored as four-character strings. For instance, a text document created by SimpleText would be listed with creator ttxt and type TEXT. From Perl (MacPerl only) we can get at this information through the MacPerl::GetFileInfo( ) function. The syntax is:

$type = MacPerl::GetFileInfo(filename);

or:

($creator,$type) = MacPerl::GetFileInfo(filename);

To find all of the text files in a MacOS filesystem, we can do the following:

use File::Find;

&File::Find::find(\&wanted,"Macintosh HD:");

sub wanted{
  -f $_ && MacPerl::GetFileInfo($_) eq "TEXT" && 
           print "$Find::File::name\n";
}

You might notice it looks a little different from our previous examples. However, it is functionally equivalent. We're just calling the File::Find routines directly without the find.pl shim. We're also using the variable $name defined in the File::Find namespace to print the absolute path of the file, rather than just printing the filename itself. Table 2-2 shows the complete list of variables defined by File::Find as it walks a filesystem.

Table 2.2. File::Find Variables

Variable Name	Meaning
$_	Current filename
$File::Find::dir	Current directory name
$File::Find::name	Full path of current filename (i.e., `$File::Find::dir/$_)`

Here's a similar NT/2000-specific example:

use File::Find;
use Win32::File;

&File::Find::find(\&wanted,"\\");

sub wanted{
  -f $_ && 
    # attr will be populated by Win32::File::GetAttributes function
    (Win32::File::GetAttributes($_,$attr)) &&      
    ($attr & HIDDEN) && 
     print "$File::Find::name\n";
}

This example searches the entire filesystem of the current drive for hidden files (i.e., those with the HIDDEN attribute set). This example works on both NTFS and FAT filesystems.

Here's an NTFS-specific example that will look for all files that have Full Access enabled for the special group Everyone and print their names:

use File::Find;
use Win32::FileSecurity;

# determine the DACL mask for Full Access
$fullmask = Win32::FileSecurity::MakeMask(FULL);

&find(\&wanted,"\\");

sub wanted {
    # Win32::FileSecurity::Get does not like the paging file, skip it
    next if ($_ eq "pagefile.sys"); 
    (-f $_) && 
        Win32::FileSecurity::Get($_, \%users) && 
        (defined $users{"Everyone"}) &&           
        ($users{"Everyone"} == $fullmask) &&
	     print "$File::Find::name\n";
}

In the above code, we query the Access Control List for all files (except for the Windows NT paging file). We then check if that list includes an entry for the group Everyone. If it does, we compare the Everyone entry to the value for Full Access (computed by MakeMask( )), printing the absolute path of the file when we find a match.

Here is another real life example of how useful even simple code can be. I recently attempted to defragment the (newly rebuilt) NT partition on my laptop when the software reported Metadata Corruption Error. Perusing the web site of the vendor who makes the defragmentation software, I encountered a tech support note that suggested, "This situation can be caused by a long filename which contains more characters than is legal under Windows NT." It then suggested locating this file by copying each folder to a new location, comparing the number of files in the copy to the original, and if the copied folder has fewer files, then identifying which file in the original folder did not get copied to the new location.

This seemed like a ridiculous suggestion given the number of folders on my NT partition and the amount of time it would take. Instead, I whipped up the following in about a minute using the methods we've been discussing:

require "find.pl";

# Traverse desired filesystems

&find('.');
print "max:$max\n";

exit;

sub wanted {
    return unless -f $_;
    if (length($_) > $maxlength){
        $max = $name;
        $maxlength = length($_);
    }
    if (length($name) > 200) { print $name,"\n";}
}

This printed out the name of all the files with names larger than 200 characters, followed by the name of the largest file found. Job done, thanks to Perl.

When Not to Use the File::Find Module

When is the File::Find method we've been discussing not appropriate? Four situations come to mind:

If the filesystem you are traversing does not follow the normal semantics, you can't use it. For instance, in the bouncing laptop scenario which began the chapter, the Linux NTFS filesystem driver I was using had the strange property of not listing "." or ".." in empty directories. This broke File::Find badly.
If you need to change the names of the directories in the filesystem you are traversing while you are traversing it, File::Find gets very unhappy and behaves in an unpredictable way.
If you need your (Unix-based) code to chase symbolic links to directories, File::Find will skip them.
If you need to walk a non-native filesystem mounted on your machine (for example, an NFS mount of a Unix filesystem on a Win32 machine), File::Find will attempt to use the native operating systems's filesystem semantics.

It is unlikely that you'll encounter these situations, but if you do, see the first filesystem-walking section of this chapter for information on how to traverse filesystems by hand.

Let's return to Unix to close this section with a moderately complex example. One idea that seems to get short shrift in many systems administration contexts, but can yield tremendous benefit in the end, is the notion of empowering the user. If your users can fix their own problems with tools you provide, everybody wins.

Much of this chapter is devoted to dealing with problems that arise from filesystems being filled. Often this occurs because users do not know enough about their environment, or because it is too cumbersome to perform any basic disk space management. Many a support request starts with "I'm out of disk space in my home directory and I don't know why." Here's a bare-bones version of a script called needspace that can help users with this problem. A user simply types needspace and the script attempts to locate items in the user's home directory that could be deleted. It looks for two kinds of files: known backup files and those that can be recreated automatically. Let's dive into the code :

use File::Find;
use File::Basename;

# array of fname extensions and the extensions they can be derived from
% derivations = (".dvi" => ".tex",
                ".aux" => ".tex",
                ".toc" => ".tex",
                ".o"   => ".c",
	            );

We start by loading the libraries we need: our friend File::Find and another useful library called File::Basename. File::Basename will come in handy for parsing pathnames. We then initialize a hash table with known derivations; for instance, we know that running the command TeX or LaTeX on the file happy.tex can generate the file happy.dvi, and that happy.o could possibly be created by running a C compiler on happy.c. The word "possibly" is used because sometimes multiple source files are needed to generate a single derived file. We can only make simple guesses based on file extensions. Generalized dependency analysis is a complex problem we won't attempt to touch here.

Next we locate the user's home directory by finding the user ID of the person running the script ($<) and feeding it to getpwuid( ). getpwuid( ) returns password information in list form (more on this in the next chapter), from which an array index ([7]) selects the home directory element. There are shell-specific ways to retrieve this information (e.g., querying the $HOME environment variable), but the code as written is more portable.

Once we have the home directory, we enter it and begin scanning using a &find( ) call just like the ones we've seen before:

$homedir=(getpwuid($<))[7]; # find the user's home directory

chdir($homedir) or 
  die "Unable to change to your homedir $homedir:$!\n";

$|=1; # print to STDOUT in an unbuffered way
print "Scanning";
find(\&wanted, "."); # chew through dirs, &wanted does the work

Here's the &wanted( ) subroutine we call. It starts by looking for core files and emacs backup and autosave files. We assume these files can be deleted without checking for their source file (perhaps not a safe assumption). If one of these files is found, its size and location is stored in a hash whose key is the path to the file and whose value is the size of that file.

The remaining checks for derivable files are very similar. They call a routine &BaseFileExists( ) to check if a particular file can be derived from another file in that directory. If this routine returns true, we store filename and size info for later retrieval:

sub wanted {
    # print a dot for every dir so the user knows we're doing something
    print "." if (-d $_); 

    # we're only checking files 
    return unless (-f $_);  

    # check for core files, store them in the %core table, then return
    $_ eq "core" && ($core{$File::Find::name} = (stat(_))[7]) && return;

    # check for emacs backup and autosave files
    (/^#.*#$/ || /~$/) && 
      ($emacs{$File::Find::name}=(stat(_))[7]) && 
      return;
    
    # check for derivable tex files
     (/\.dvi$/ || /\.aux$/ || /\.toc$/) &&
       &BaseFileExists($File::Find::name) && 
       ($tex{$File::Find::name} = (stat(_))[7]) && 
       return;

    # check for derivable .o files
     /\.o$/ &&
       &BaseFileExists($File::Find::name) && 
       ($doto{$File::Find::name} = (stat(_))[7]) && 
       return;
}

Here's the routine which checks if a particular file can be derived from another "base" file in the same directory (i.e., does happy.o exist if we find happy.c):

sub BaseFileExists {
    my($name,$path,$suffix) =
     &File::Basename::fileparse($_[0],'\..*');
    
    # if we don't know how to derive this type of file
    return 0 unless (defined $derivations{$suffix}); 

    # easy, we've seen the base file before
    return 1 if (defined $baseseen{$path.$name.$derivations{$suffix}});

    # if file (or file link points to) exists and has non-zero size
    return 1 if (-s $name.$derivations{$suffix} && 
                  ++$baseseen{$path.$name.$derivations{$suffix}});
}    

print "done.\n";

Here's how this code works:

&File::Basename::fileparse( ) is used to separate the path into a filename, its leading path, and its suffix (e.g., resume.dvi, /home/cindy/docs/, .dvi ).
This file's suffix is checked to determine if it is one we recognize as being derivable. If not, we return (false in a scalar context).
We check if we've already seen a "base file" for this particular file, and if so return true. In some situations (TeX/LaTeX in particular), a single base file can yield many derived files. This check speeds things up considerably because it saves us a trip to the filesystem.
If we haven't seen a base file for this file before, we check to see if one exists and that it is non-zero length. If so, we cache the base file information and return 1 (true in a scalar context).

All that's left for us to do now is to print out the information we gathered as we walked the filesystem:

foreach my $path (keys %core){
    print "Found a core file taking up ".&BytesToMeg($core{$path}).
          "MB in ".&File::Basename::dirname($path).".\n";
}

if (keys %emacs){
    print "The following are most likely emacs backup files:\n";

    foreach my $path (keys %emacs){
        $tempsize += $emacs{$path};
        $path =~ s/^$homedir/~/;     # change the path for prettier output
        print "$path ($emacs{$path} bytes)\n";
    }
    print "\nThese files take up ".&BytesToMeg($tempsize)."MB total.\n";
    $tempsize=0;
}

if (keys %tex){
    print "The following are most likely files that can be recreated by                      
           running La/TeX:\n";
    foreach my $path (keys %tex){
        $tempsize += $tex{$path};
        $path =~ s/^$homedir/~/;     # change the path for prettier output
        print "$path ($tex{$path} bytes)\n";
    }
    print "\nThese files take up ".&BytesToMeg($tempsize)."MB total.\n";
    $tempsize=0;
}

if (keys %doto){
    print "The following are most likely files that can be recreated by
           recompiling source:\n";
    foreach my $path (keys %doto){
        $tempsize += $doto{$path};
        $path =~ s/^$homedir/~/;     # change the path for prettier output
        print "$path ($doto{$path} bytes)\n";
    }
    print "\nThese files take up ".&BytesToMeg($tempsize)."MB total.\n";
    $tempsize=0;
}

sub BytesToMeg{ # convert bytes to X.XXMB
    return sprintf("%.2f",($_[0]/1024000));
}

Before we close this section, it should be noted that the previous example could be extended in many ways. The sky's really the limit on this sort of program. Here are a few ideas:

Use a more complex data structure to hold derivations and found files. The above code was written to be easy to read without much knowledge of Perl data structures. It repeats itself in several places and is harder to extend than it needs to be. Ideally, you'd like a way to represent all of the derivations without having special case hashes (e.g., %tex) in the code.
Search for web browser cache directories (a common source of missing disk space).
Offer to delete files that are found. The operator unlink( ) and the subroutine rmpath from the File::Path module would be used to perform the deletion step.
Perform more analysis on files instead of making guesses based on filenames.