home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Chapter 4  TOC  Chapter 6

Chapter 5. Larger System Examples II

5.1 "The Greps of Wrath"

This chapter continues our exploration of systems programming case studies. Here, the focus is on Python scripts that perform more advanced kinds of file and directory processing. The examples in this chapter do system-level tasks such as converting files, comparing and copying directories, and searching files and directories for strings -- a task idiomatically known as "grepping."

Most of the tools these scripts employ were introduced in Chapter 2. Here, the goal is to show these tools in action, in the context of more useful and realistic programs. As in the prior chapter, learning about Python programming techniques such as OOP and encapsulation is also a hidden subgoal of most of the examples presented here.

5.2 Fixing DOS Line Ends

When I wrote the first edition of this book, I shipped two copies of every example file on the CD-ROM (view CD-ROM content online at http://examples.oreilly.com/python2) -- one with Unix line-end markers, and one with DOS markers. The idea was that this would make it easy to view and edit the files on either platform. Readers would simply copy the examples directory tree designed for their platform onto their hard drive, and ignore the other one.

If you read Chapter 2, you know the issue here: DOS (and by proxy, Windows) marks line ends in text files with the two characters \r\n (carriage-return, line-feed), but Unix uses just a single \n. Most modern text editors don't care -- they happily display text files encoded in either format. Some tools are less forgiving, though. I still occasionally see odd \r characters when viewing DOS files on Unix, or an entire file in a single line when looking at Unix files on DOS (the Notepad accessory does this on Windows, for example).

Because this is only an occasional annoyance, and because it's easy to forget to keep two distinct example trees in sync, I adopted a different policy for this second edition: we're shipping a single copy of the examples (in DOS format), along with a portable converter tool for changing to and from other line-end formats.

The main obstacle, of course, is how to go about providing a portable and easy to use converter -- one that runs "out of the box" on almost every computer, without changes or recompiles. Some Unix platforms have commands like fromdos and dos2unix, but they are not universally available even on Unix. DOS batch files and csh scripts could do the job on Windows and Unix, respectively, but neither solution works on both platforms.

Fortunately, Python does. The scripts presented in Examples Example 5-1, Example 5-3, and Example 5-4 convert end-of-line markers between DOS and Unix formats; they convert a single file, a directory of files, and a directory tree of files. In this section, we briefly look at each of the three scripts, and contrast some of the system tools they apply. Each reuses the prior's code, and becomes progressively more powerful in the process.

The last of these three scripts, Example 5-4, is the portable converter tool I was looking for; it converts line ends in the entire examples tree, in a single step. Because it is pure Python, it also works on both DOS and Unix unchanged; as long as Python is installed, it is the only line converter you may ever need to remember.

5.2.1 Converting Line Ends in One File

These three scripts were developed in stages on purpose, so I could first focus on getting line-feed conversions right, before worrying about directories and tree walking logic. With that scheme in mind, Example 5-1 addresses just the task of converting lines in a single text file.

Example 5-1. PP2E\PyTools\fixeoln_one.py
###################################################################
# Use: "python fixeoln_one.py [tounix|todos] filename".
# Convert end-of-lines in the single text file whose name is passed
# in on the command line, to the target format (tounix or todos).  
# The _one, _dir, and _all converters reuse the convert function 
# here.  convertEndlines changes end-lines only if necessary:
# lines that are already in the target format are left unchanged,
# so it's okay to convert a file > once with any of the 3 fixeoln 
# scripts.  Notes: must use binary file open modes for this to 
# work on Windows, else default text mode automatically deletes 
# the \r on reads, and adds an extra \r for each \n on writes;
# Mac format not supported; PyTools\dumpfile.py shows raw bytes;
###################################################################
 
import os
listonly = 0   # 1=show file to be changed, don't rewrite
 
def convertEndlines(format, fname):                      # convert one file
    if not os.path.isfile(fname):                        # todos:  \n   => \r\n 
        print 'Not a text file', fname                   # tounix: \r\n => \n
        return                                           # skip directory names
 
    newlines = []
    changed  = 0 
    for line in open(fname, 'rb').readlines(  ):           # use binary i/o modes
        if format == 'todos':                            # else \r lost on Win
            if line[-1:] == '\n' and line[-2:-1] != '\r':
                line = line[:-1] + '\r\n'
                changed = 1
        elif format == 'tounix':                         # avoids IndexError
            if line[-2:] == '\r\n':                      # slices are scaled
                line = line[:-2] + '\n'
                changed = 1
        newlines.append(line)
 
    if changed:
        try:                                             # might be read-only
            print 'Changing', fname
            if not listonly: open(fname, 'wb').writelines(newlines) 
        except IOError, why:
            print 'Error writing to file %s: skipped (%s)' % (fname, why)
 
if __name__ == '__main__':
    import sys
    errmsg = 'Required arguments missing: ["todos"|"tounix"] filename'
    assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg
    convertEndlines(sys.argv[1], sys.argv[2])
    print 'Converted', sys.argv[2]

This script is fairly straightforward as system utilities go; it relies primarily on the built-in file object's methods. Given a target format flag and filename, it loads the file into a lines list using the readlines method, converts input lines to the target format if needed, and writes the result back to the file with the writelines method if any lines were changed:

C:\temp\examples>python %X%\PyTools\fixeoln_one.py tounix PyDemos.pyw
Changing PyDemos.pyw
Converted PyDemos.pyw
 
C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw
Changing PyDemos.pyw
Converted PyDemos.pyw
 
C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw
Comparing files PyDemos.pyw and C:\PP2ndEd\examples\PP2E\PyDemos.pyw
FC: no differences encountered
 
C:\temp\examples>python %X%\PyTools\fixeoln_one.py todos PyDemos.pyw
Converted PyDemos.pyw
 
C:\temp\examples>python %X%\PyTools\fixeoln_one.py toother nonesuch.txt
Traceback (innermost last):
  File "C:\PP2ndEd\examples\PP2E\PyTools\fixeoln_one.py", line 45, in ?
    assert (len(sys.argv) == 3 and sys.argv[1] in ['todos', 'tounix']), errmsg
AssertionError: Required arguments missing: ["todos"|"tounix"] filename

Here, the first command converts the file to Unix line-end format (tounix), and the second and fourth convert to the DOS convention -- all regardless of the platform on which this script is run. To make typical usage easier, converted text is written back to the file in place, instead of to a newly created output file. Notice that this script's filename has a "_" in it, not a "-"; because it is meant to be both run as a script and imported as a library, its filename must translate to a legal Python variable name in importers (fixeoln-one.py won't work for both roles).

In all the examples in this chapter that change files in directory trees, the C:\temp\examples and C:\temp\cpexamples directories used in testing are full copies of the real PP2E examples root directory. I don't always show the copy commands used to create these test directories along the way (at least not until we've written our own in Python).

5.2.1.1 Slinging bytes and verifying results

The fc DOS file-compare command in the preceding interaction confirms the conversions, but to better verify the results of this Python script, I wrote another, shown in Example 5-2.

Example 5-2. PP2E\PyTools\dumpfile.py
import sys
bytes = open(sys.argv[1], 'rb').read(  )
print '-'*40
print repr(bytes)
 
print '-'*40
while bytes:
    bytes, chunk = bytes[4:], bytes[:4]          # show 4-bytes per line
    for c in chunk: print oct(ord(c)), '\t',     # show octal of binary value
    print 
 
print '-'*40
for line in open(sys.argv[1], 'rb').readlines(  ):
    print repr(line)

To give a clear picture of a file's contents, this script opens a file in binary mode (to suppress automatic line-feed conversions), prints its raw contents (bytes) all at once, displays the octal numeric ASCII codes of it contents four bytes per line, and shows its raw lines. Let's use this to trace conversions. First of all, use a simple text file to make wading through bytes a bit more humane:

C:\temp>type test.txt
a
b
c
 
C:\temp>python %X%\PyTools\dumpfile.py test.txt
----------------------------------------
'a\015\012b\015\012c\015\012'
----------------------------------------
0141    015     012     0142
015     012     0143    015
012
----------------------------------------
'a\015\012'
'b\015\012'
'c\015\012'

The test.txt file here is in DOS line-end format -- the escape sequence \015\012 displayed by the dumpfile script is simply the DOS \r\n line-end marker in octal character-code escapes format. Now, converting to Unix format changes all the DOS \r\n markers to a single \n (\012) as advertised:

C:\temp>python %X%\PyTools\fixeoln_one.py tounix test.txt
Changing test.txt
Converted test.txt
 
C:\temp>python %X%\PyTools\dumpfile.py test.txt
----------------------------------------
'a\012b\012c\012'
----------------------------------------
0141    012     0142    012
0143    012
----------------------------------------
'a\012'
'b\012'
'c\012'

And converting back to DOS restores the original file format:

C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt
Changing test.txt
Converted test.txt
 
C:\temp>python %X%\PyTools\dumpfile.py test.txt
----------------------------------------
'a\015\012b\015\012c\015\012'
----------------------------------------
0141    015     012     0142
015     012     0143    015
012
----------------------------------------
'a\015\012'
'b\015\012'
'c\015\012'
 
C:\temp>python %X%\PyTools\fixeoln_one.py todos test.txt    # makes no changes
Converted test.txt
5.2.1.2 Nonintrusive conversions

Notice that no "Changing" message is emitted for the last command just run, because no changes were actually made to the file (it was already in DOS format). Because this program is smart enough to avoid converting a line that is already in the target format, it is safe to rerun on a file even if you can't recall what format the file already uses. More naive conversion logic might be simpler, but may not be repeatable. For instance, a string.replace call can be used to expand a Unix \n to a DOS \r\n (\015\012), but only once:

>>> import string
>>> lines = 'aaa\nbbb\nccc\n'
>>> lines = string.replace(lines, '\n', '\r\n')         # okay: \r added
>>> lines
'aaa\015\012bbb\015\012ccc\015\012'
>>> lines = string.replace(lines, '\n', '\r\n')         # bad: double \r
>>> lines
'aaa\015\015\012bbb\015\015\012ccc\015\015\012'

Such logic could easily trash a file if applied to it twice.[1] To really understand how the script gets around this problem, though, we need to take a closer look at its use of slices and binary file modes.

5.2.1.3 Slicing strings out-of-bounds

This script relies on subtle aspects of string slicing behavior to inspect parts of each line without size checks. For instance:

·         The expression line[-2:] returns the last two characters at the end of the line (or one or zero characters, if the line isn't at least two characters long).

·         A slice like line[-2:-1] returns the second to last character (or an empty string, if the line is too small to have a second to last character).

·         The operation line[:-2] returns all characters except the last two at the end (or an empty string, if there are fewer than three characters).

Because out-of-bounds slices scale slice limits to be in-bounds, the script doesn't need to add explicit tests to guarantee that the line is big enough to have end-line characters at the end. For example:

>>> 'aaaXY'[-2:], 'XY'[-2:], 'Y'[-2:], ''[-2:]
('XY', 'XY', 'Y', '')
 
>>> 'aaaXY'[-2:-1], 'XY'[-2:-1], 'Y'[-2:-1], ''[-2:-1]
('X', 'X', '', '')
 
>>> 'aaaXY'[:-2], 'aaaY'[:-1], 'XY'[:-2], 'Y'[:-1]
('aaa', 'aaa', '', '')

If you imagine characters like \r and \n instead of the X and Y here, you'll understand how the script exploits slice scaling to good effect.

5.2.1.4 Binary file mode revisited

Because this script aims to be portable to Windows, it also takes care to open files in binary mode, even though they contain text data. As we've seen, when files are opened in text mode on Windows, \r is stripped from \r\n markers on input, and \r is added before \n markers on output. This automatic conversion allows scripts to represent the end-of-line marker as \n on all platforms. Here, though, it would also mean that the script would never see the \r it's looking for to detect a DOS-encoded line -- the \r would be dropped before it ever reached the script:

>>> open('temp.txt', 'w').writelines(['aaa\n', 'bbb\n'])
>>> open('temp.txt', 'rb').read(  )
'aaa\015\012bbb\015\012'
>>> open('temp.txt', 'r').read(  )
'aaa\012bbb\012'

Without binary open mode, this can lead to fairly subtle and incorrect behavior on Windows. For example, if files are opened in text mode, converting in "todos" mode on Windows would actually produce double \r characters: the script might convert the stripped \n to \r\n, which is then expanded on output to \r\r\n !

>>> open('temp.txt', 'w').writelines(['aaa\r\n', 'bbb\r\n'])
>>> open('temp.txt', 'rb').read(  )
'aaa\015\015\012bbb\015\015\012'

With binary mode, the script inputs a full \r\n, so no conversion is performed. Binary mode is also required for output on Windows, to suppress the insertion of \r characters; without it, the "tounix" conversion would fail on that platform.[2]

If all that is too subtle to bear, just remember to use the "b" in file open mode strings if your scripts might be run on Windows, and you mean to process either true binary data or text data as it is actually stored in the file.

Macintosh Line Conversions

As coded, the convertEndlines function does not support Macintosh single \r line terminators at all. It neither converts to Macintosh terminators from DOS and Unix format (\r\n and \n to \r), nor converts from Macintosh terminators to DOS or Unix format (\r to \r\n or \n). Files in Mac format pass untouched through both the "todos" and "tounix" conversions in this script (study the code to see why). I don't use a Mac, but some readers may.

Since adding Mac support would make this code more complex, and since I don't like publishing code in books unless it's been well tested, I'll leave such an extension as an exercise for the Mac Python users in the audience. But for implementation hints, see file PP2E\PyTools\fixeoln_one_mac.py on the CD (see http://examples.oreilly.com/python2). When run on Windows, it does to-Mac conversions:

C:\temp>python %X%\PyTools\fixeoln_one_mac.py tomac test.txt
Changing test.txt
Converted test.txt
 
C:\temp>python %X%\PyTools\dumpfile.py test.txt
----------------------------------------
'a\015b\015c\015'
----------------------------------------
0141    015     0142    015
0143    015
----------------------------------------
'a\015b\015c\015'

but fails to convert files already in Mac format to Unix or DOS, because the file readlines method does not treat a bare \r as a line break on that platform. The last output line is a single file line, as far as Windows is concerned; converting back to DOS just adds a single \n at its end.

5.2.2 Converting Line Ends in One Directory

Armed with a fully debugged single file converter, it's an easy step to add support for converting all files in a single directory. Simply call the single file converter on every filename returned by a directory listing tool. The script in Example 5-3 uses the glob module we met in Chapter 2Chapter 2 to grab a list of files to convert.

Example 5-3. PP2E\PyTools\fixeoln_dir.py
#########################################################
# Use: "python fixeoln_dir.py [tounix|todos] patterns?".
# convert end-lines in all the text files in the current
# directory (only: does not recurse to subdirectories). 
# Reuses converter in the single-file _one version.
#########################################################
 
import sys, glob
from fixeoln_one import convertEndlines
listonly = 0
patts = ['*.py', '*.pyw', '*.txt', '*.cgi', '*.html',    # text file names
         '*.c',  '*.cxx', '*.h',   '*.i',   '*.out',     # in this package
         'README*', 'makefile*', 'output*', '*.note']
 
if __name__ == '__main__':
    errmsg = 'Required first argument missing: "todos" or "tounix"'
    assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg
 
    if len(sys.argv) > 2:                 # glob anyhow: '*' not applied on dos
        patts = sys.argv[2:]              # though not really needed on linux
    filelists = map(glob.glob, patts)     # name matches in this dir only 
 
    count = 0
    for list in filelists:
        for fname in list:
            if listonly:
                print count+1, '=>', fname
            else:
                convertEndlines(sys.argv[1], fname)
            count = count + 1
 
    print 'Visited %d files' % count

This module defines a list, patts, containing filename patterns that match all the kinds of text files that appear in the book examples tree; each pattern is passed to the built-in glob.glob call by map, to be separately expanded into a list of matching files. That's why there are nested for loops near the end -- the outer loop steps through each glob result list, and the inner steps through each name within each list. Try the map call interactively if this doesn't make sense:

>>> import glob
>>> map(glob.glob, ['*.py', '*.html'])
[['helloshell.py'], ['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']]

This script requires a convert mode flag on the command line, and assumes that it is run in the directory where files to be converted live; cd to the directory to be converted before running this script (or change it to accept a directory name argument too):

C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix 
Changing Launcher.py
Changing Launch_PyGadgets.py
Changing LaunchBrowser.py
 ...lines deleted...
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing README-PP2E.txt
Visited 21 files
 
C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos 
Changing Launcher.py
Changing Launch_PyGadgets.py
Changing LaunchBrowser.py
 ...lines deleted...
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing README-PP2E.txt
Visited 21 files
 
C:\temp\examples>python %X%\PyTools\fixeoln_dir.py todos     # makes no changes
Visited 21 files
 
C:\temp\examples>fc PyDemos.pyw %X%\PyDemos.pyw 
Comparing files PyDemos.pyw and C:\PP2ndEd\examples\PP2E\PyDemos.pyw
FC: no differences encountered

Notice that the third command generated no "Changing" messages again. Because the convertEndlines function of the single-file module is reused here to perform the actual updates, this script inherits that function's repeatability : it's okay to rerun this script on the same directory any number of times. Only lines that require conversion will be converted. This script also accepts an optional list of filename patterns on the command line, to override the default patts list of files to be changed:

C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh
Changing echoEnvironment.pyw
Changing Launch_PyDemos.pyw
Changing Launch_PyGadgets_bar.pyw
Changing PyDemos.pyw
Changing PyGadgets_bar.pyw
Changing cleanall.csh
Changing makeall.csh
Changing package.csh
Changing setup-pp.csh
Changing setup-pp-embed.csh
Changing xferall.linux.csh
Visited 11 files
 
C:\temp\examples>python %X%\PyTools\fixeoln_dir.py tounix *.pyw *.csh
Visited 11 files

Also notice that the single-file script's convertEndlines function performs an initial os.path.isfile test to make sure the passed-in filename represents a file, not a directory; when we start globbing with patterns to collect files to convert, it's not impossible that a pattern's expansion might include the name of a directory along with the desired files.

Unix and Linux users: Unix-like shells automatically glob (i.e., expand) filename pattern operators like * in command lines before they ever reach your script. You generally need to quote such patterns to pass them in to scripts verbatim (e.g., "*.py").The fixeoln_dir script will still work if you don't—its glob.glob calls will simply find a single matching filename for each already-globbed name, and so have no effect:

>>>glob.glob('PyDemos.pyw')
['PyDemos.pyw']

Patterns are not pre-globbed in the DOS shell, though, so the glob.glob calls here are still a good idea in scripts that aspire to be as portable as this one.

5.2.3 Converting Line Ends in an Entire Tree

Finally, Example 5-4 applies what we've already learned to an entire directory tree. It simply runs the file-converter function to every filename produced by tree-walking logic. In fact, this script really just orchestrates calls to the original and already debugged convertEndlines function.

Example 5-4. PP2E\PyTools\fixeoln_all.py
#########################################################
# Use: "python fixeoln_all.py [tounix|todos] patterns?".
# find and convert end-of-lines in all text files at and
# below the directory where this script is run (the dir 
# you are in when you type 'python'). If needed, tries to 
# use the Python find.py library module, else reads the 
# output of a unix-style find command; uses a default 
# filename patterns list if patterns argument is absent.
# This script only changes files that need to be changed, 
# so it's safe to run brute-force from a root-level dir.
#########################################################
 
import os, sys, string
debug    = 0
pyfind   = 0      # force py find
listonly = 0      # 1=show find results only
 
def findFiles(patts, debug=debug, pyfind=pyfind):
    try:
        if sys.platform[:3] == 'win' or pyfind:
            print 'Using Python find'
            try:
                import find                        # use python-code find.py
            except ImportError:                    # use mine if deprecated!
                from PP2E.PyTools import find      # may get from my dir anyhow
            matches = map(find.find, patts)        # startdir default = '.'
        else:
            print 'Using find executable'
            matches = []
            for patt in patts:
                findcmd = 'find . -name "%s" -print' % patt  # run find command
                lines = os.popen(findcmd).readlines(  )        # remove endlines
                matches.append(map(string.strip, lines))     # lambda x: x[:-1]
    except:
        assert 0, 'Sorry - cannot find files'
    if debug: print matches
    return matches
 
if __name__ == '__main__':
    from fixeoln_dir import patts
    from fixeoln_one import convertEndlines
 
    errmsg = 'Required first argument missing: "todos" or "tounix"'
    assert (len(sys.argv) >= 2 and sys.argv[1] in ['todos', 'tounix']), errmsg
 
    if len(sys.argv) > 2:                  # quote in unix shell 
        patts = sys.argv[2:]               # else tries to expand
    matches = findFiles(patts)
 
    count = 0
    for matchlist in matches:                 # a list of lists
        for fname in matchlist:               # one per pattern
            if listonly:
                print count+1, '=>', fname 
            else:  
                convertEndlines(sys.argv[1], fname)
            count = count + 1
    print 'Visited %d files' % count

On Windows, the script uses the portable find.find built-in tool we met in Chapter 2 (either Python's or the hand-rolled equivalent)[3] to generate a list of all matching file and directory names in the tree; on other platforms, it resorts to spawning a less portable and probably slower find shell command just for illustration purposes.

Once the file pathname lists are compiled, this script simply converts each found file in turn using the single-file converter module's tools. Here is the collection of scripts at work converting the book examples tree on Windows; notice that this script also processes the current working directory (CWD; cd to the directory to be converted before typing the command line), and that Python treats forward and backward slashes the same in the program filename:

C:\temp\examples>python %X%/PyTools/fixeoln_all.py tounix 
Using Python find
Changing .\LaunchBrowser.py
Changing .\Launch_PyGadgets.py
Changing .\Launcher.py
Changing .\Other\cgimail.py
 ...lots of lines deleted...
Changing .\EmbExt\Exports\ClassAndMod\output.prog1
Changing .\EmbExt\Exports\output.prog1
Changing .\EmbExt\Regist\output
Visited 1051 files
 
C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos 
Using Python find
Changing .\LaunchBrowser.py
Changing .\Launch_PyGadgets.py
Changing .\Launcher.py
Changing .\Other\cgimail.py
 ...lots of lines deleted...
Changing .\EmbExt\Exports\ClassAndMod\output.prog1
Changing .\EmbExt\Exports\output.prog1
Changing .\EmbExt\Regist\output
Visited 1051 files
 
C:\temp\examples>python %X%/PyTools/fixeoln_all.py todos 
Using Python find
Not a text file .\Embed\Inventory\Output
Not a text file .\Embed\Inventory\WithDbase\Output
Visited 1051 files

The first two commands convert over 1000 files, and usually take some eight seconds of real-world time to finish on my 650 MHz Windows 98 machine; the third takes only six seconds, because no files have to be updated (and fewer messages have to be scrolled on the screen). Don't take these figures too seriously, though; they can vary by system load, and much of this time is probably spent scrolling the script's output to the screen.

5.2.3.1 The view from the top

This script and its ancestors are shipped on the book's CD, as that portable converter tool I was looking for. To convert all examples files in the tree to Unix line-terminator format, simply copy the entire PP2E examples tree to some "examples" directory on your hard drive, and type these two commands in a shell:

cd examples/PP2E
python PyTools/fixeoln_all.py tounix

Of course, this assumes Python is already installed (see the CD's README file for details; see http://examples.oreilly.com/python2), but will work on almost every platform in use today.[4] To convert back to DOS, just replace "tounix" with "todos" and rerun. I ship this tool with a training CD for Python classes I teach too; to convert those files, we simply type:

cd Html\Examples
python ..\..\Tools\fixeoln_all.py tounix

Once you get accustomed to the command lines, you can use this in all sorts of contexts. Finally, to make the conversion easier for beginners to run, the top-level examples directory includes tounix.py and todos.py scripts that can be simply double-clicked in a file explorer GUI; Example 5-5 shows the "tounix" converter.

Example 5-5. PP2E\tounix.py
#!/usr/local/bin/python
######################################################################
# Run me to convert all text files to UNIX/Linux line-feed format.
# You only need to do this if you see odd '\r' characters at the end
# of lines in text files in this distribution, when they are viewed 
# with your text editor (e.g., vi).  This script converts all files 
# at and below the examples root, and only converts files that have  
# not already been converted (it's okay to run this multiple times).
#
# Since this is a Python script which runs another Python script, 
# you must install Python first to run this program; then from your
# system command-line (e.g., a xterm window), cd to the directory 
# where this script lives, and then type "python tounix.py".  You 
# may also be able to simply click on this file's icon in your file
# system explorer, if it knows what '.py' file are.
###################################################################### 
 
import os
prompt = """
This program converts all text files in the book
examples distribution to UNIX line-feed format.
Are you sure you want to do this (y=yes)? """
 
answer = raw_input(prompt) 
if answer not in ['y', 'Y', 'yes']:
    print 'Cancelled'
else:
    os.system('python PyTools/fixeoln_all.py tounix')

This script addresses the end user's perception of usability, but other factors impact programmer usability -- just as important to systems that will be read or changed by others. For example, the file, directory, and tree converters are coded in separate script files, but there is no law against combining them into a single program that relies on a command-line arguments pattern to know which of the three modes to run. The first argument could be a mode flag, tested by such a program:

if   mode == '-one':
    ...
elif mode == '-dir':
    ...
elif mode == '-all:
    ...

That seems more confusing than separate files per mode, though; it's usually much easier to botch a complex command line than to type a specific program file's name. It will also make for a confusing mix of global names, and one very big piece of code at the bottom of the file. As always, simpler is usually better.

5.3 Fixing DOS Filenames

The heart of the prior script was findFiles, a function than knows how to portably collect matching file and directory names in an entire tree, given a list of filename patterns. It doesn't do much more than the built-in find.find call, but can be augmented for our own purposes. Because this logic was bundled up in a function, though, it automatically becomes a reusable tool.

For example, the next script imports and applies findFiles, to collect all file names in a directory tree, by using the filename pattern * (it matches everything). I use this script to fix a legacy problem in the book's examples tree. The names of some files created under MS-DOS were made all uppercase; for example, spam.py became SPAM.PY somewhere along the way. Because case is significant both in Python and on some platforms, an import statement like "import spam" will sometimes fail for uppercase filenames.

To repair the damage everywhere in the thousand-file examples tree, I wrote and ran Example 5-6. It works like this: For every filename in the tree, it checks to see if the name is all uppercase, and asks the console user whether the file should be renamed with the os.rename call. To make this easy, it also comes up with a reasonable default for most new names -- the old one in all-lowercase form.

Example 5-6. PP2E\PyTools\fixnames_all.py
#########################################################
# Use: "python ..\..\PyTools\fixnames_all.py".
# find all files with all upper-case names at and below
# the current directory ('.'); for each, ask the user for
# a new name to rename the file to; used to catch old 
# uppercase file names created on MS-DOS (case matters on
# some platforms, when importing Python module files);
# caveats: this may fail on case-sensitive machines if 
# directory names are converted before their contents--the
# original dir name in the paths returned by find may no 
# longer exist; the allUpper heuristic also fails for 
# odd filenames that are all non-alphabetic (ex: '.');
#########################################################
 
import os, string
listonly = 0
 
def allUpper(name):
    for char in name:
        if char in string.lowercase:    # any lowercase letter disqualifies
            return 0                    # else all upper, digit, or special 
    return 1 
 
def convertOne(fname):
    fpath, oldfname = os.path.split(fname)
    if allUpper(oldfname):
        prompt = 'Convert dir=%s file=%s? (y|Y)' % (fpath, oldfname)
        if raw_input(prompt) in ['Y', 'y']:
            default  = string.lower(oldfname)
            newfname = raw_input('Type new file name (enter=%s): ' % default)
            newfname = newfname or default
            newfpath = os.path.join(fpath, newfname)
            os.rename(fname, newfpath)
            print 'Renamed: ', fname
            print 'to:      ', str(newfpath)
            raw_input('Press enter to continue')
            return 1
    return 0
 
if __name__ == '__main__':
    patts = "*"                              # inspect all file names
    from fixeoln_all import findFiles        # reuse finder function
    matches = findFiles(patts)
 
    ccount = vcount = 0
    for matchlist in matches:                # list of lists, one per pattern
        for fname in matchlist:              # fnames are full directory paths
            print vcount+1, '=>', fname      # includes names of directories 
            if not listonly:  
                ccount = ccount + convertOne(fname)
            vcount = vcount + 1
    print 'Converted %d files, visited %d' % (ccount, vcount)

As before, the findFiles function returns a list of simple filename lists, representing the expansion of all patterns passed in (here, just one result list, for the wildcard pattern * ).[5] For each file and directory name in the result, this script's convertOne function prompts for name changes; an os.path.split and an os.path.join call combination portably tacks the new filename onto the old directory name. Here is a renaming session in progress on Windows:

C:\temp\examples>python %X%\PyTools\fixnames_all.py 
Using Python find
1 => .\.cshrc
2 => .\LaunchBrowser.out.txt
3 => .\LaunchBrowser.py
...
 ...more deleted...
...
218 => .\Ai
219 => .\Ai\ExpertSystem
220 => .\Ai\ExpertSystem\TODO
Convert dir=.\Ai\ExpertSystem file=TODO? (y|Y)n 
221 => .\Ai\ExpertSystem\__init__.py
222 => .\Ai\ExpertSystem\holmes
223 => .\Ai\ExpertSystem\holmes\README.1ST
Convert dir=.\Ai\ExpertSystem\holmes file=README.1ST? (y|Y)y 
Type new file name (enter=readme.1st):
Renamed:  .\Ai\ExpertSystem\holmes\README.1st
to:       .\Ai\ExpertSystem\holmes\readme.1st
Press enter to continue
224 => .\Ai\ExpertSystem\holmes\README.2ND
Convert dir=.\Ai\ExpertSystem\holmes file=README.2ND? (y|Y)y 
Type new file name (enter=readme.2nd): readme-more 
Renamed:  .\Ai\ExpertSystem\holmes\README.2nd
to:       .\Ai\ExpertSystem\holmes\readme-more
Press enter to continue
...
 ...more deleted...
...
1471 => .\todos.py
1472 => .\tounix.py
1473 => .\xferall.linux.csh
Converted 2 files, visited 1473

This script could simply convert every all-uppercase name to an all-lowercase equivalent automatically, but that's potentially dangerous (some names might require mixed-case). Instead, it asks for input during the traversal, and shows the results of each renaming operation along the way.

5.3.1 Rewriting with os.path.walk

Notice, though, that the pattern-matching power of the find.find call goes completely unused in this script. Because it always must visit every file in the tree, the os.path.walk interface we studied in Chapter 2 would work just as well, and avoids any initial pause while a filename list is being collected (that pause is negligible here, but may be significant for larger trees). Example 5-7 is an equivalent version of this script that does its tree traversal with the walk callbacks-based model.

Example 5-7. PP2E\PyTools\fixnames_all2.py
###############################################################
# Use: "python ..\..\PyTools\fixnames_all2.py".
# same, but use the os.path.walk interface, not find.find;
# to make this work like the simple find version, puts of
# visiting directories until just before visiting their
# contents (find.find lists dir names before their contents);
# renaming dirs here can fail on case-sensitive platforms 
# too--walk keeps extending paths containing old dir names;
###############################################################
 
import os
listonly = 0
from fixnames_all import convertOne
 
def visitname(fname):
    global ccount, vcount
    print vcount+1, '=>', fname
    if not listonly:
        ccount = ccount + convertOne(fname)
    vcount = vcount + 1
 
def visitor(myData, directoryName, filesInDirectory):  # called for each dir 
    visitname(directoryName)                           # do dir we're in now,
    for fname in filesInDirectory:                     # and non-dir files here
        fpath = os.path.join(directoryName, fname)     # fnames have no dirpath
        if not os.path.isdir(fpath):
            visitname(fpath)
     
ccount = vcount = 0
os.path.walk('.', visitor, None)
print 'Converted %d files, visited %d' % (ccount, vcount)

This version does the same job, but visits one extra file (the topmost root directory), and may visit directories in a different order (os.listdir results are unordered). Both versions run in under a dozen seconds for the example directory tree on my computer.[6] We'll revisit this script, as well as the fixeoln line-end fixer, in the context of a general tree-walker class hierarchy later in this chapter.

5.4 Searching Directory Trees

Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree, whenever I thought I'd stumbled on to a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things like program directory paths and module names were hardcoded all over the place -- in package import statements, program startup calls, text notes, configuration files, and more.

One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That's so tedious as to be utterly impossible in this book's examples tree, though; as I wrote these words, the example tree contained 118 directories and 1342 files! (To count for yourself, run a command-line python PyTools/visitor.py 1 in the PP2E examples root directory.) Clearly, I needed a way to automate updates after changes.

5.4.1 Greps and Globs in Shells and Python

There is a standard way to search files for strings on Unix and Linux systems: the command-line program grep and its relatives list all lines in one or more files containing a string or string pattern.[7] Given that Unix shells expand (i.e., "glob") filename patterns automatically, a command such as grep popen *.py will search a single directory's Python files for string "popen". Here's such a command in action on Windows (I installed a commercial Unix-like fgrep program on my Windows 98 laptop because I missed it too much there):

C:\...\PP2E\System\Filetools>fgrep popen *.py
diffall.py:# - we could also os.popen a diff (unix) or fc (dos)
dirdiff.py:# - use os.popen('ls...') or glob.glob + os.path.split
dirdiff6.py:    files1 = os.popen('ls %s' % dir1).readlines(  )
dirdiff6.py:    files2 = os.popen('ls %s' % dir2).readlines(  )
testdirdiff.py:    expected = expected + os.popen(test % 'dirdiff').read(  )
testdirdiff.py:        output = output + os.popen(test % script).read(  )

DOS has a command for searching files too -- find, not to be confused with the Unix find directory walker command:

C:\...\PP2E\System\Filetools>find /N "popen" testdirdiff.py
 
---------- testdirdiff.py
[8]    expected = expected + os.popen(test % 'dirdiff').read(  )
[15]        output = output + os.popen(test % script).read(  )

You can do the same within a Python script, by either running the previously mentioned shell command with os.system or os.popen, or combining the grep and glob built-in modules. We met the glob module in Chapter 2; it expands a filename pattern into a list of matching filename strings (much like a Unix shell). The standard library also includes a grep module, which acts like a Unix grep command: grep.grep prints lines containing a pattern string among a set of files. When used with glob, the effect is much like the fgrep command:

>>> from grep import grep
>>> from glob import glob
>>> grep('popen', glob('*.py'))
diffall.py:  16: # - we could also os.popen a diff (unix) or fc (dos)
dirdiff.py:  12: # - use os.popen('ls...') or glob.glob + os.path.split
dirdiff6.py:  19:     files1 = os.popen('ls %s' % dir1).readlines(  )
dirdiff6.py:  20:     files2 = os.popen('ls %s' % dir2).readlines(  )
testdirdiff.py:   8:     expected = expected + os.popen(test % 'dirdiff')...
testdirdiff.py:  15:         output = output + os.popen(test % script).read(  )
 
>>> import glob, grep
>>> grep.grep('system', glob.glob('*.py'))
dirdiff.py:  16: # - on unix systems we could do something similar by
regtest.py:  18:         os.system('%s < %s > %s.out 2>&1' % (program, ...
regtest.py:  23:         os.system('%s < %s > %s.out 2>&1' % (program, ...
regtest.py:  24:         os.system('diff %s.out %s.out.bkp > %s.diffs' ...

The grep module is written in pure Python code (no shell commands are run), is completely portable, and accepts both simple strings and general regular expression patterns as the search key (regular expressions appear later in this text). Unfortunately, it is also limited in two major ways:

·         It simply prints matching lines instead of returning them in a list for later processing. We could intercept and split its output by redirecting sys.stdin to an object temporarily (Chapter 2 showed how), but that's fairly inconvenient.[8]

·         More crucial here, the grep/glob combination still inspects only a single directory ; as we also saw in Chapter 2, we need to do more to search all files in an entire directory tree.

On Unix systems, we can work around the second of these limitations by running a grep shell command from within a find shell command. For instance, the following Unix command line:

find . -name "*.py" -print -exec fgrep popen {} \;

would pinpoint lines and files at and below the current directory that mention "popen". If you happen to have a Unix-like find command on every machine you will ever use, this is one way to process directories.

5.4.1.1 Cleaning up bytecode files

I used to run the script in Example 5-8 on some of my machines to remove all .pyc bytecode files in the examples tree before packaging or upgrading Pythons (it's not impossible that old binary bytecode files are not forward-compatible with newer Python releases).

Example 5-8. PP2E\PyTools\cleanpyc.py
###########################################################
# find and delete all "*.pyc" bytecode files at and below
# the directory where this script is run; this assumes a 
# Unix-like find command, and so is very non-portable; we
# could instead use the Python find module, or just walk 
# the directry trees with portable Python code; the find
# -exec option can apply a Python script to each file too;
###########################################################
 
import os, sys
 
if sys.platform[:3] == 'win':
    findcmd = r'c:\stuff\bin.mks\find . -name "*.pyc" -print'
else:
    findcmd = 'find . -name "*.pyc" -print'
print findcmd
 
count = 0
for file in os.popen(findcmd).readlines(  ):        # for all file names
    count = count + 1                             # have \n at the end
    print str(file[:-1])
    os.remove(file[:-1])
 
print 'Removed %d .pyc files' % count

This script uses os.popen to collect the output of a commercial package's find program installed on one of my Windows computers, or else the standard find tool on the Linux side. It's also completely nonportable to Windows machines that don't have the commercial find program installed, and that includes other computers in my house, and most of the world at large.

Python scripts can reuse underlying shell tools with os.popen, but by so doing they lose much of the portability advantage of the Python language. The Unix find command is both not universally available, and is a complex tool by itself (in fact, too complex to cover in this book; see a Unix manpage for more details). As we saw in Chapter 2, spawning a shell command also incurs a performance hit, because it must start a new independent program on your computer.

To avoid some of the portability and performance costs of spawning an underlying find command, I eventually recoded this script to use the find utilities we met and wrote Chapter 2. The new script is shown in Example 5-9.

Example 5-9. PP2E\PyTools\cleanpyc-py.py
###########################################################
# find and delete all "*.pyc" bytecode files at and below
# the directory where this script is run; this uses a 
# Python find call, and so is portable to most machines;
# run this to delete .pyc's from an old Python release;
# cd to the directory you want to clean before running;
###########################################################
 
import os, sys, find              # here, gets PyTools find
 
count = 0
for file in find.find("*.pyc"):   # for all file names
    count = count + 1
    print file
    os.remove(file)
 
print 'Removed %d .pyc files' % count

This works portably, and avoids external program startup costs. But find is really just a tree-searcher that doesn't let you hook into the tree search -- if you need to do something unique while traversing a directory tree, you may be better off using a more manual approach. Moreover, find must collect all names before it returns; in very large directory trees, this may introduce significant performance and memory penalties. It's not an issue for my trees, but your trees may vary.

5.4.2 A Python Tree Searcher

To help ease the task of performing global searches on all platforms I might ever use, I coded a Python script to do most of the work for me. Example 5-10 employs standard Python tools we met in the preceding chapters:

·         os.path.walk to visit files in a directory

·         sting.find to search for a string in a text read from a file

·         os.path.splitext to skip over files with binary-type extensions

·         os.path.join to portably combine a directory path and filename

·         os.path.isdir to skip paths that refer to directories, not files

Because it's pure Python code, though, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than using op.popen to spawn a find command that spawns many grep commands.

Example 5-10. PP2E\PyTools\search_all.py
#########################################################
# Use: "python ..\..\PyTools\search_all.py string".
# search all files at and below current directory
# for a string; uses the os.path.walk interface,
# rather than doing a find to collect names first;
#########################################################
 
import os, sys, string
listonly = 0
skipexts = ['.gif', '.exe', '.pyc', '.o', '.a']        # ignore binary files
 
def visitfile(fname, searchKey):                       # for each non-dir file
    global fcount, vcount                              # search for string
    print vcount+1, '=>', fname                        # skip protected files
    try:
        if not listonly:
            if os.path.splitext(fname)[1] in skipexts:
                print 'Skipping', fname
            elif string.find(open(fname).read(  ), searchKey) != -1:
                raw_input('%s has %s' % (fname, searchKey))
                fcount = fcount + 1
    except: pass
    vcount = vcount + 1
 
def visitor(myData, directoryName, filesInDirectory):  # called for each dir 
    for fname in filesInDirectory:                     # do non-dir files here
        fpath = os.path.join(directoryName, fname)     # fnames have no dirpath
        if not os.path.isdir(fpath):                   # myData is searchKey
            visitfile(fpath, myData)
     
def searcher(startdir, searchkey):
    global fcount, vcount
    fcount = vcount = 0
    os.path.walk(startdir, visitor, searchkey)
 
if __name__ == '__main__':
    searcher('.', sys.argv[1])
    print 'Found in %d files, visited %d' % (fcount, vcount)

This file also uses the sys.argv command-line list and the __name__ trick for running in two modes. When run standalone, the search key is passed on the command line; when imported, clients call this module's searcher function directly. For example, to search (grep) for all appearances of directory name "Part2" in the examples tree (an old directory that really did go away!), run a command line like this in a DOS or Unix shell:

C:\...\PP2E>python PyTools\search_all.py Part2 
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
.\Launcher.py has Part2 
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
8 => .\LaunchBrowser.out.txt
.\LaunchBrowser.out.txt has Part2 
9 => .\LaunchBrowser.py
.\LaunchBrowser.py has Part2 
...
 ...more lines deleted
...
1339 => .\old_Part2\Basics\unpack2b.py
1340 => .\old_Part2\Basics\unpack3.py
1341 => .\old_Part2\Basics\__init__.py
Found in 74 files, visited 1341

The script lists each file it checks as it goes, tells you which files it is skipping (names that end in extensions listed in variable skipexts that imply binary data), and pauses for an Enter key press each time it announces a file containing the search string (bold lines). A solution based on find could not pause this way; although trivial in this example, find doesn't return until the entire tree traversal is finished. The search_all script works the same when imported instead of run, but there is no final statistics output line (fcount and vcount live in the module, and so would have to be imported to be inspected here):

>>> from PP2E.PyTools.search_all import searcher 
>>> searcher('.', '-exec')           # find files with string '-exec'
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
8 => .\LaunchBrowser.out.txt
9 => .\LaunchBrowser.py
10 => .\Launch_PyGadgets_bar.pyw
11 => .\makeall.csh
12 => .\package.csh
.\package.csh has -exec 
 ...more lines deleted...

However launched, this script tracks down all references to a string in an entire directory tree -- a name of a changed book examples file, object, or directory, for instance.[9]

5.5 Visitor: Walking Trees Generically

Armed with the portable search_all script from Example 5-10, I was able to better pinpoint files to be edited, every time I changed the book examples tree structure. At least initially, I ran search_all to pick out suspicious files in one window, and edited each along the way by hand in another window.

Pretty soon, though, this became tedious too. Manually typing filenames into editor commands is no fun, especially when the number of files to edit is large. The search for "Part2" shown earlier returned 74 files, for instance. Since there are at least occasionally better things to do than manually start 74 editor sessions, I looked for a way to automatically run an editor on each suspicious file.

Unfortunately, search_all simply prints results to the screen. Although that text could be intercepted and parsed, a more direct approach that spawns edit sessions during the search may be easier, but may require major changes to the tree search script as currently coded. At this point, two thoughts came to mind.

First, I knew it would be easier in the long-run to be able to add features to a general directory searcher as external components, not by changing the original script. Because editing files was just one possible extension (what about automating text replacements too?), a more generic, customizable, and reusable search component seemed the way to go.

Second, after writing a few directory walking utilities, it became clear that I was rewriting the same sort of code over and over again. Traversals could be even further simplified by wrapping common details for easier reuse. The os.path.walk tool helps, but its use tends to foster redundant operations (e.g., directory name joins), and its function-object-based interface doesn't quite lend itself to customization the way a class can.

Of course, both goals point to using an OO framework for traversals and searching. Example 5-11 is one concrete realization of these goals. It exports a general FileVisitor class that mostly just wraps os.path.walk for easier use and extension, as well as a generic SearchVisitor class that generalizes the notion of directory searches. By itself, SearchVisitor simply does what search_all did, but it also opens up the search process to customization -- bits of its behavior can be modified by overloading its methods in subclasses. Moreover, its core search logic can be reused everywhere we need to search; simply define a subclass that adds search-specific extensions.

Example 5-11. PP2E\PyTools\visitor.py
#############################################################
# Test: "python ..\..\PyTools\visitor.py testmask [string]".
# Uses OOP, classes, and subclasses to wrap some of the 
# details of using os.path.walk to walk and search; testmask 
# is an integer bitmask with 1 bit per available selftest;
# see also: visitor_edit/replace/find/fix*/.py subclasses,
# and the fixsitename.py client script in Internet\Cgi-Web;
#############################################################
 
import os, sys, string
listonly = 0
 
class FileVisitor:
    """
    visits all non-directory files below startDir;
    override visitfile to provide a file handler
    """
    def __init__(self, data=None, listonly=0):
        self.context  = data
        self.fcount   = 0
        self.dcount   = 0
        self.listonly = listonly
    def run(self, startDir=os.curdir):                  # default start='.'
        os.path.walk(startDir, self.visitor, None)    
    def visitor(self, data, dirName, filesInDir):       # called for each dir 
        self.visitdir(dirName)                          # do this dir first
        for fname in filesInDir:                        # do non-dir files 
            fpath = os.path.join(dirName, fname)        # fnames have no path
            if not os.path.isdir(fpath):
                self.visitfile(fpath)
    def visitdir(self, dirpath):                        # called for each dir
        self.dcount = self.dcount + 1                   # override or extend me
        print dirpath, '...'
    def visitfile(self, filepath):                      # called for each file
        self.fcount = self.fcount + 1                   # override or extend me
        print self.fcount, '=>', filepath               # default: print name
 
class SearchVisitor(FileVisitor):
    """ 
    search files at and below startDir for a string
    """
    skipexts = ['.gif', '.exe', '.pyc', '.o', '.a']     # skip binary files
    def __init__(self, key, listonly=0):
        FileVisitor.__init__(self, key, listonly)
        self.scount = 0
    def visitfile(self, fname):                         # test for a match
        FileVisitor.visitfile(self, fname)
        if not self.listonly:
            if os.path.splitext(fname)[1] in self.skipexts:
                print 'Skipping', fname
            else:
                text = open(fname).read(  )
                if string.find(text, self.context) != -1:
                    self.visitmatch(fname, text)
                    self.scount = self.scount + 1
    def visitmatch(self, fname, text):                     # process a match
        raw_input('%s has %s' % (fname, self.context))     # override me lower
 
 
# self-test logic
dolist   = 1
dosearch = 2    # 3=do list and search
donext   = 4    # when next test added
 
def selftest(testmask):
    if testmask & dolist:
       visitor = FileVisitor(  )
       visitor.run('.')
       print 'Visited %d files and %d dirs' % (visitor.fcount, visitor.dcount)
 
    if testmask & dosearch:   
       visitor = SearchVisitor(sys.argv[2], listonly)
       visitor.run('.')
       print 'Found in %d files, visited %d' % (visitor.scount, visitor.fcount)
 
if __name__ == '__main__':
    selftest(int(sys.argv[1]))    # e.g., 5 = dolist | dorename

This module primarily serves to export classes for external use, but it does something useful when run standalone too. If you invoke it as a script with a single argument "1", it makes and runs a FileVisitor object, and prints an exhaustive listing of every file and directory at and below the place you are at when the script is invoked (i.e., ".", the current working directory):

C:\temp>python %X%\PyTools\visitor.py 1 
. ...
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
5 => .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
 ...more deleted...
479 => .\Gui\Clock\plotterGui.py
480 => .\Gui\Clock\plotterText.py
481 => .\Gui\Clock\plotterText1.py
482 => .\Gui\Clock\__init__.py
.\Gui\gifs ...
483 => .\Gui\gifs\frank.gif
484 => .\Gui\gifs\frank.note
485 => .\Gui\gifs\gilligan.gif
486 => .\Gui\gifs\gilligan.note
 ...more deleted...
1352 => .\PyTools\visitor_fixnames.py
1353 => .\PyTools\visitor_find_quiet2.py
1354 => .\PyTools\visitor_find.pyc
1355 => .\PyTools\visitor_find_quiet1.py
1356 => .\PyTools\fixeoln_one.doc.txt
Visited 1356 files and 119 dirs

If you instead invoke this script with a "2" as its first argument, it makes and runs a SearchVisitor object, using the second argument as the search key. This form is equivalent to running the search_all.py script we met earlier; it pauses for an Enter key press after each matching file is reported (lines in bold font here):

C:\temp\examples>python %X%\PyTools\visitor.py 2 Part3 
. ...
1 => .\autoexec.bat
2 => .\cleanall.csh
.\cleanall.csh has Part3 
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
.\Launcher.py has Part3 
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
7 => .\Launch_PyDemos.pyw
8 => .\LaunchBrowser.out.txt
9 => .\LaunchBrowser.py
10 => .\Launch_PyGadgets_bar.pyw
11 => .\makeall.csh
.\makeall.csh has Part3 
...
 ...more deleted
...
1353 => .\PyTools\visitor_find_quiet2.py
1354 => .\PyTools\visitor_find.pyc
Skipping .\PyTools\visitor_find.pyc
1355 => .\PyTools\visitor_find_quiet1.py
1356 => .\PyTools\fixeoln_one.doc.txt
Found in 49 files, visited 1356

Technically, passing this script a first argument "3" runs both a FileVisitor and a SearchVisitor (two separate traversals are performed). The first argument is really used as a bitmask to select one or more supported self-tests -- if a test's bit is on in the binary value of the argument, the test will be run. Because 3 is 011 in binary, it selects both a search (010) and a listing (001). In a more user-friendly system we might want to be more symbolic about that (e.g., check for "-search" and "-list" arguments), but bitmasks work just as well for this script's scope.

Text Editor War and Peace

In case you don't know, the vi setting used in the visitor_edit.py script is a Unix text editor; it's available for Windows too, but is not standard there. If you run this script, you'll probably want to change its editor setting on your machine. For instance, "emacs" should work on Linux, and "edit" or "notepad" should work on all Windows boxes.

These days, I tend to use an editor I coded in Python (PyEdit), so I'll leave the editor wars to more politically-minded readers. In fact, changing the script to assign editor either of these ways:

    editor = r'python Gui\TextEditor\textEditor.pyw'
    editor = r'start  Gui\TextEditor\textEditor.pyw'

will open the matched file in a pure and portable Python text editor GUI -- one coded in Python with the Tkinter interface, which runs on all major GUI platforms, and which we'll meet in Chapter 9. If you read about the start command in Chapter 3, you know that the first editor setting pauses the traversal while the editor runs, but the second does not (you'll get as many PyEdit windows as there are matched files).

This may fail, however, for very long file directory names (remember, os.system has a length limit unlike os.spawnv). Moreover, the path to the textEditor.pyw program may vary depending on where you are when you run visitor_edit.py (i.e., the CWD). There are ways around this latter problem:

·         Prefixing the script's path string with the value of the PP2EHOME shell variable, fetched with os.environ; with the standard book setup scripts, PP2EHOME gives the absolute root directory, from which the editor script's path can be found.

·         Prefixing the path with sys.path[0] and a '../' to exploit the fact that the first import directory is always the script's home directory (see Section 2.7 in Chapter 2).

·         Windows shortcuts or Unix links to the editor script from the CWD.

·         Searching for the script naively with Launcher.findFirst or guessLocation, described near the end of Chapter 4.

But these are all beyond the scope of a sidebar on text editor politics.

5.5.1 Editing Files in Directory Trees

Now, after genericizing tree traversals and searches, it's an easy step to add automatic file editing in a brand-new, separate component. Example 5-12 defines a new EditVisitor class that simply customizes the visitmatch method of the SearchVisitor class, to open a text editor on the matched file. Yes, this is the complete program -- it needs to do something special only when visiting matched files, and so need provide only that behavior; the rest of the traversal and search logic is unchanged and inherited.

Example 5-12. PP2E\PyTools\visitor_edit.py
###############################################################
# Use: "python PyTools\visitor_edit.py string".
# add auto-editor start up to SearchVisitor in an external
# component (subclass), not in-place changes; this version 
# automatically pops up an editor on each file containing the
# string as it traverses; you can also use editor='edit' or 
# 'notepad' on windows; 'vi' and 'edit' run in console window;
# editor=r'python Gui\TextEditor\textEditor.pyw' may work too;
# caveat: we might be able to make this smarter by sending
# a search command to go to the first match in some editors; 
###############################################################
 
import os, sys, string
from visitor import SearchVisitor
listonly = 0
 
class EditVisitor(SearchVisitor):
    """ 
    edit files at and below startDir having string
    """
    editor = 'vi'  # ymmv
    def visitmatch(self, fname, text):
        os.system('%s %s' % (self.editor, fname))
 
if __name__  == '__main__':
    visitor = EditVisitor(sys.argv[1], listonly)
    visitor.run('.')
    print 'Edited %d files, visited %d' % (visitor.scount, visitor.fcount)

When we make and run an EditVisitor, a text editor is started with the os.system command-line spawn call, which usually blocks its caller until the spawned program finishes. On my machines, each time this script finds a matched file during the traversal, it starts up the vi text editor within the console window where the script was started; exiting the editor resumes the tree walk.

Let's find and edit some files. When run as a script, we pass this program the search string as a command argument (here, the string "-exec" is the search key, not an option flag). The root directory is always passed to the run method as ".", the current run directory. Traversal status messages show up in the console as before, but each matched file now automatically pops up in a text editor along the way. Here, the editor is started eight times:

C:\...\PP2E>python PyTools\visitor_edit.py -exec 
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
 ...more deleted...
1340 => .\old_Part2\Basics\unpack2.py
1341 => .\old_Part2\Basics\unpack2b.py
1342 => .\old_Part2\Basics\unpack3.py
1343 => .\old_Part2\Basics\__init__.py
Edited 8 files, visited 1343

This, finally, is the exact tool I was looking for to simplify global book examples tree maintenance. After major changes to things like shared modules and file and directory names, I run this script on the examples root directory with an appropriate search string, and edit any files it pops up as needed. I still need to change files by hand in the editor, but that's often safer than blind global replacements.

5.5.2 Global Replacements in Directory Trees

But since I brought it up: given a general tree traversal class, it's easy to code a global search-and-replace subclass too. The FileVisitor subclass in Example 5-13, ReplaceVisitor, customizes the visitfile method to globally replace any appearances of one string with another, in all text files at and below a root directory. It also collects the names of all files that were changed in a list, just in case you wish to go through and verify the automatic edits applied (a text editor could be automatically popped up on each changed file, for instance).

Example 5-13. PP2E\PyTools\visitor_replace.py
################################################################
# Use: "python PyTools\visitor_replace.py fromStr toStr".
# does global search-and-replace in all files in a directory
# tree--replaces fromStr with toStr in all text files; this
# is powerful but dangerous!! visitor_edit.py runs an editor
# for you to verify and make changes, and so is much safer;
# use CollectVisitor to simply collect a list of matched files;
################################################################
 
import os, sys, string
from visitor import SearchVisitor
listonly = 0
 
class ReplaceVisitor(SearchVisitor):
    """ 
    change fromStr to toStr in files at and below startDir;
    files changed available in obj.changed list after a run
    """
    def __init__(self, fromStr, toStr, listonly=0):
        self.changed = []
        self.toStr   = toStr
        SearchVisitor.__init__(self, fromStr, listonly)
    def visitmatch(self, fname, text):
        fromStr, toStr = self.context, self.toStr
        text = string.replace(text, fromStr, toStr)
        open(fname, 'w').write(text)
        self.changed.append(fname)
 
if __name__  == '__main__':
    if raw_input('Are you sure?') == 'y':
        visitor = ReplaceVisitor(sys.argv[1], sys.argv[2], listonly)
        visitor.run(startDir='.')
        print 'Visited %d files'  % visitor.fcount
        print 'Changed %d files:' % len(visitor.changed)
        for fname in visitor.changed: print fname 

To run this script over a directory tree, go to the directory to be changed and run the following sort of command line, with "from" and "to" strings. On my current machine, doing this on a 1354-file tree and changing 75 files along the way takes roughly six seconds of real clock time when the system isn't particularly busy:

C:\temp\examples>python %X%/PyTools/visitor_replace.py Part2 SPAM2 
Are you sure?
. ...
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
4 => .\Launcher.py
5 => .\Launcher.pyc
Skipping .\Launcher.pyc
6 => .\Launch_PyGadgets.py
 ...more deleted...
1351 => .\PyTools\visitor_find_quiet2.py
1352 => .\PyTools\visitor_find.pyc
Skipping .\PyTools\visitor_find.pyc
1353 => .\PyTools\visitor_find_quiet1.py
1354 => .\PyTools\fixeoln_one.doc.txt
Visited 1354 files
Changed 75 files:
.\Launcher.py
.\LaunchBrowser.out.txt
.\LaunchBrowser.py
.\PyDemos.pyw
.\PyGadgets.py
.\README-PP2E.txt
 ...more deleted...
.\PyTools\search_all.out.txt
.\PyTools\visitor.out.txt
.\PyTools\visitor_edit.py
 
[to delete, use an empty toStr]
C:\temp\examples>python %X%/PyTools/visitor_replace.py SPAM "" 

This is both wildly powerful and dangerous. If the string to be replaced is something that can show up in places you didn't anticipate, you might just ruin an entire tree of files by running the ReplaceVisitor object defined here. On the other hand, if the string is something very specific, this object can obviate the need to automatically edit suspicious files. For instance, we will use this approach to automatically change web site addresses in HTML files in Chapter 12; the addresses are likely too specific to show up in other places by chance.

5.5.3 Collecting Matched Files in Trees

The scripts so far search and replace in directory trees, using the same traversal code base (module visitor). Suppose, though, that you just want to get a Python list of files in a directory containing a string. You could run a search and parse the output messages for "found" messages. Much simpler, simply knock off another SearchVisitor subclass to collect the list along the way, as in Example 5-14.

Example 5-14. PP2E\PyTools\visitor_collect.py
#################################################################
# Use: "python PyTools\visitor_collect.py searchstring".
# CollectVisitor simply collects a list of matched files, for
# display or later processing (e.g., replacement, auto-editing);
#################################################################
 
import os, sys, string
from visitor import SearchVisitor
 
class CollectVisitor(SearchVisitor):
    """
    collect names of files containing a string;
    run this and then fetch its obj.matches list
    """
    def __init__(self, searchstr, listonly=0):
        self.matches = []
        SearchVisitor.__init__(self, searchstr, listonly) 
    def visitmatch(self, fname, text):
        self.matches.append(fname)
 
if __name__  == '__main__':
    visitor = CollectVisitor(sys.argv[1])
    visitor.run(startDir='.')
    print 'Found these files:'
    for fname in visitor.matches: print fname 

CollectVisitor is just tree search again, with a new kind of specialization -- collecting files, instead of printing messages. This class is useful from other scripts that mean to collect a matched files list for later processing; it can be run by itself as a script too:

C:\...\PP2E>python PyTools\visitor_collect.py -exec 
...
 ...more deleted...
...
1342 => .\old_Part2\Basics\unpack2b.py
1343 => .\old_Part2\Basics\unpack3.py
1344 => .\old_Part2\Basics\__init__.py
Found these files:
.\package.csh
.\README-PP2E.txt
.\readme-old-pp1E.txt
.\PyTools\cleanpyc.py
.\PyTools\fixeoln_all.py
.\System\Processes\output.txt
.\Internet\Cgi-Web\fixcgi.py
5.5.3.1 Suppressing status messages

Here, the items in the collected list are displayed at the end -- all the files containing the string "-exec". Notice, though, that traversal status messages are still printed along the way (in fact, I deleted about 1600 lines of such messages here!). In a tool meant to be called from another script, that may be an undesirable side effect; the calling script's output may be more important than the traversal's.

We could add mode flags to SearchVisitor to turn off status messages, but that makes it more complex. Instead, the following two files show how we might go about collecting matched filenames without letting any traversal messages show up in the console, all without changing the original code base. The first, shown in Example 5-15, simply takes over and copies the search logic, without print statements. It's a bit redundant with SearchVisitor, but only in a few lines of mimicked code.

Example 5-15. PP2E\PyTools\visitor_collect_quiet1.py
##############################################################
# Like visitor_collect, but avoid traversal status messages
##############################################################
 
import os, sys, string
from visitor import FileVisitor, SearchVisitor
 
class CollectVisitor(FileVisitor):
    """
    collect names of files containing a string, silently;
    """
    skipexts = SearchVisitor.skipexts
    def __init__(self, searchStr):
        self.matches = []
        self.context = searchStr
    def visitdir(self, dname): pass
    def visitfile(self, fname):
        if (os.path.splitext(fname)[1] not in self.skipexts and
            string.find(open(fname).read(  ), self.context) != -1):
            self.matches.append(fname)
 
if __name__  == '__main__':
    visitor = CollectVisitor(sys.argv[1])
    visitor.run(startDir='.')
    print 'Found these files:'
    for fname in visitor.matches: print fname 

When this class is run, only the contents of the matched filenames list show up at the end; no status messages appear during the traversal. Because of that, this form may be more useful as a general-purpose tool used by other scripts:

C:\...\PP2E>python PyTools\visitor_collect_quiet1.py -exec
Found these files:
.\package.csh
.\README-PP2E.txt
.\readme-old-pp1E.txt
.\PyTools\cleanpyc.py
.\PyTools\fixeoln_all.py
.\System\Processes\output.txt
.\Internet\Cgi-Web\fixcgi.py

A more interesting and less redundant way to suppress printed text during a traversal is to apply the stream redirection tricks we met in Chapter 2. Example 5-16 sets sys.stdin to a NullOut object that throws away all printed text for the duration of the traversal (its write method does nothing).

The only real complication with this scheme is that there is no good place to insert a restoration of sys.stdout at the end of the traversal; instead, we code the restore in the __del__ destructor method, and require clients to delete the visitor to resume printing as usual. An explicitly called method would work just as well, if you prefer less magical interfaces.

Example 5-16. PP2E\PyTools\visitor_collect_quiet2.py
##############################################################
# Like visitor_collect, but avoid traversal status messages
##############################################################
 
import os, sys, string
from visitor import SearchVisitor
 
class NullOut:
    def write(self, line): pass
 
class CollectVisitor(SearchVisitor):
    """
    collect names of files containing a string, silently
    """
    def __init__(self, searchstr, listonly=0):
        self.matches = []
        self.saveout, sys.stdout = sys.stdout, NullOut(  )
        SearchVisitor.__init__(self, searchstr, listonly) 
    def __del__(self):
        sys.stdout = self.saveout
    def visitmatch(self, fname, text):
        self.matches.append(fname)
 
if __name__  == '__main__':
    visitor = CollectVisitor(sys.argv[1])
    visitor.run(startDir='.')
    matches = visitor.matches
    del visitor
    print 'Found these files:'
    for fname in matches: print fname 

When this script is run, output is identical to the prior run -- just the matched filenames at the end. Perhaps better still, why not code and debug just one verbose CollectVisitor utility class, and require clients to wrap calls to its run method in the redirect.redirect function we wrote back in Example 2-10 ?

>>> from PP2E.PyTools.visitor_collect import CollectVisitor
>>> from PP2E.System.Streams.redirect import redirect
>>> walker = CollectVisitor('-exec')                   # object to find '-exec'
>>> output = redirect(walker.run, ('.',), '')          # function, args, input
>>> for line in walker.matches: print line             # print items in list
...
.\package.csh
.\README-PP2E.txt
.\readme-old-pp1E.txt
.\PyTools\cleanpyc.py
.\PyTools\fixeoln_all.py
.\System\Processes\output.txt
.\Internet\Cgi-Web\fixcgi.py

The redirect call employed here resets standard input and output streams to file-like objects for the duration of any function call; because of that, it's a more general way to suppress output than recoding every outputter. Here, it has the effect of intercepting (and hence suppressing) printed messages during a walker.run('.') traversal. They really are printed, but show up in the string result of the redirect call, not on the screen:

>>> output[:60]
'. ...\0121 => .\\autoexec.bat\0122 => .\\cleanall.csh\0123 => .\\echoEnv'
 
>>> import string
>>> len(output), len(string.split(output, '\n'))           # bytes, lines
(67609, 1592)
 
>>> walker.matches
['.\\package.csh', '.\\README-PP2E.txt', '.\\readme-old-pp1E.txt', 
'.\\PyTools\\cleanpyc.py', '.\\PyTools\\fixeoln_all.py',
'.\\System\\Processes\\output.txt', 
'.\\Internet\\Cgi-Web\\fixcgi.py']

Because redirect saves printed text in a string, it may be less appropriate than the two quiet CollectVisitor variants for functions that generate much output. Here, for example, 67,609 bytes of output was queued up in an in-memory string (see the len call results); such a buffer may or may not be significant in some applications.

In more general terms, redirecting sys.stdout to dummy objects as done here is a simple way to turn off outputs (and is the equivalent to the Unix notion of redirecting output to file /dev/null -- a file that discards everything sent to it). For instance, we'll pull this trick out of the bag again in the context of server-side Internet scripting, to prevent utility status messages from showing up in generated web page output streams.[10]

5.5.4 Recoding Fixers with Visitors

Be warned: once you've written and debugged a class that knows how to do something useful like walking directory trees, it's easy for it to spread throughout your system utility libraries. Of course, that's the whole point of code reuse. For instance, very soon after writing the visitor classes presented in the prior sections, I recoded both the fixnames_all.py and fixeoln_all.py directory walker scripts listed earlier in Examples Example 5-6 and Example 5-4, respectively, to use visitor instead of proprietary tree-walk logic (they both originally used find.find). Example 5-17 combines the original convertLines function (to fix end-of-lines in a single file) with visitor's tree walker class, to yield an alternative implementation of the line-end converter for directory trees.

Example 5-17. PP2E\PyTools\visitor_fixeoln.py
##############################################################
# Use: "python visitor_fixeoln.py todos|tounix".
# recode fixeoln_all.py as a visitor subclass: this version
# uses os.path.walk, not find.find to collext all names first;
# limited but fast: if os.path.splitext(fname)[1] in patts:
##############################################################
 
import visitor, sys, fnmatch, os
from fixeoln_dir import patts
from fixeoln_one import convertEndlines
 
class EolnFixer(visitor.FileVisitor):
    def visitfile(self, fullname):                        # match on basename
        basename = os.path.basename(fullname)             # to make result same
        for patt in patts:                                # else visits fewer 
            if fnmatch.fnmatch(basename, patt):
                convertEndlines(self.context, fullname)
                self.fcount = self.fcount + 1             # could break here
                                                          # but results differ
if __name__ == '__main__':
    walker = EolnFixer(sys.argv[1])
    walker.run(  )
    print 'Files matched (converted or not):', walker.fcount

As we saw in Chapter 2, the built-in fnmatch module performs Unix shell-like filename matching; this script uses it to match names to the previous version's filename patterns (simply looking for filename extensions after a "." is simpler, but not as general):

C:\temp\examples>python %X%/PyTools/visitor_fixeoln.py tounix 
. ...
Changing .\echoEnvironment.pyw
Changing .\Launcher.py
Changing .\Launch_PyGadgets.py
Changing .\Launch_PyDemos.pyw
 ...more deleted...
Changing .\PyTools\visitor_find.py
Changing .\PyTools\visitor_fixnames.py
Changing .\PyTools\visitor_find_quiet2.py
Changing .\PyTools\visitor_find_quiet1.py
Changing .\PyTools\fixeoln_one.doc.txt
Files matched (converted or not): 1065
 
C:\temp\examples>python %X%/PyTools/visitor_fixeoln.py tounix 
 ...more deleted...
.\Extend\Swig\Shadow ...
.\ ...
.\EmbExt\Exports ...
.\EmbExt\Exports\ClassAndMod ...
.\EmbExt\Regist ...
.\PyTools ...
Files matched (converted or not): 1065

If you run this script and the original fixeoln_all.py on the book examples tree, you'll notice that this version visits two fewer matched files. This simply reflects the fact that fixeoln_all also collects and skips over two directory names for its patterns in the find.find result (both called "Output"). In all other ways, this version works the same way even when it could do better -- adding a break statement after the convertEndlines call here avoids visiting files that appear redundantly in the original's find results lists.

The first command here takes roughly six seconds on my computer, and the second takes about four (there are no files to be converted). That's faster than the eight- and six-second figures for the original find.find-based version of this script, but they differ in amount of output, and benchmarks are usually much more subtle than you imagine. Most of the real clock time is likely spent scrolling text in the console, not doing any real directory processing. Since both are plenty fast for their intended purposes, finer-grained performance figures are left as exercises.

The script in Example 5-18 combines the original convertOne function (to rename a single file or directory) with the visitor's tree walker class, to create a directory tree-wide fix for uppercase filenames. Notice that we redefine both file and directory visitation methods here, as we need to rename both.

Example 5-18. PP2E\PyTools\visitor_fixnames.py
###############################################################
# recode fixnames_all.py name case fixer with the Visitor class
# note: "from fixnames_all import convertOne" doesn't help at 
# top-level of the fixnames class, since it is assumed to be a
# method and called with extra self argument (an exception);
###############################################################
 
from visitor import FileVisitor
 
class FixnamesVisitor(FileVisitor):
    """
    check filenames at and below startDir for uppercase
    """
    import fixnames_all
    def __init__(self, listonly=0):
        FileVisitor.__init__(self, listonly=listonly)
        self.ccount = 0
    def rename(self, pathname):
        if not self.listonly:
            convertflag = self.fixnames_all.convertOne(pathname)
            self.ccount = self.ccount + convertflag
    def visitdir(self, dirname):
        FileVisitor.visitdir(self, dirname)
        self.rename(dirname)
    def visitfile(self, filename):
        FileVisitor.visitfile(self, filename)
        self.rename(filename)
 
if __name__ == '__main__': 
    walker = FixnamesVisitor(  )
    walker.run(  )
    allnames = walker.fcount + walker.dcount
    print 'Converted %d files, visited %d' % (walker.ccount, allnames)

This version is run like the original find.find based version, fixnames_all, but visits one more name (the top-level root directory), and there is no initial delay while filenames are collected on a list -- we're using os.path.walk again, not find.find. It's also close to the original os.path.walk version of this script, but is based on a class hierarchy, not direct function callbacks:

C:\temp\examples>python %X%/PyTools/visitor_fixnames.py 
 ...more deleted...
303 => .\__init__.py
304 => .\__init__.pyc
305 => .\Ai\ExpertSystem\holmes.tar
306 => .\Ai\ExpertSystem\TODO
Convert dir=.\Ai\ExpertSystem file=TODO? (y|Y) 
307 => .\Ai\ExpertSystem\__init__.py
308 => .\Ai\ExpertSystem\holmes\cnv
309 => .\Ai\ExpertSystem\holmes\README.1ST
Convert dir=.\Ai\ExpertSystem\holmes file=README.1ST? (y|Y) 
 ...more deleted...
1353 => .\PyTools\visitor_find.pyc
1354 => .\PyTools\visitor_find_quiet1.py
1355 => .\PyTools\fixeoln_one.doc.txt
Converted 1 files, visited 1474

Both of these fixer scripts work roughly the same as the originals, but because the directory walking logic lives in just one file (visitor.py), it only needs to be debugged once. Moreover, improvements in that file will automatically be inherited by every directory-processing tool derived from its classes. Even when coding system-level scripts, reuse and reduced redundancy pay off in the end.

5.5.5 Fixing File Permissions in Trees

Just in case the preceding visitor-client sections weren't quite enough to convince you of the power of code reuse, another piece of evidence surfaced very late in this book project. It turns out that copying files off a CD using Windows drag-and-drop makes them read-only in the copy. That's less than ideal for the book examples directory on the enclosed CD (see http://examples.oreilly.com/python2) -- you must copy the directory tree onto your hard drive to be able to experiment with program changes (naturally, files on CD can't be changed in place). But if you copy with drag-and-drop, you may wind up with a tree of over 1000 read-only files.

Since drag-and-drop is perhaps the most common way to copy off a CD on Windows, I needed a portable and easy-to-use way to undo the read-only setting. Asking readers to make these all writable by hand would be impolite to say the least. Writing a full-blown install system seemed like overkill. Providing different fixes for different platforms doubles or triples the complexity of the task.

Much better, the Python script in Example 5-19 can be run in the root of the copied examples directory to repair the damage of a read-only drag-and-drop operation. It specializes the traversal implemented by the FileVisitor class again -- this time to run an os.chmod call on every file and directory visited along the way.

Example 5-19. PP2E\PyTools\fixreadonly-all.py
#!/usr/bin/env python
###############################################################
# Use: python PyTools\fixreadonly-all.py 
# run this script in the top-level examples directory after
# copying all examples off the book's CD-ROM, to make all 
# files writeable again--by default, copying files off the 
# CD with Windows drag-and-drop (at least) creates them as 
# read-only on your hard drive; this script traverses entire 
# dir tree at and below the dir it is run in (all subdirs);
###############################################################
 
import os, string
from PP2E.PyTools.visitor import FileVisitor          # os.path.walk wrapper
listonly = 0
 
class FixReadOnly(FileVisitor):
    def __init__(self, listonly=0):    
        FileVisitor.__init__(self, listonly=listonly) 
    def visitDir(self, dname):
        FileVisitor.visitfile(self, fname)
        if self.listonly:
            return
        os.chmod(dname, 0777)
    def visitfile(self, fname):                     
        FileVisitor.visitfile(self, fname)
        if self.listonly:
            return
        os.chmod(fname, 0777)
 
if __name__ == '__main__':
    # don't run auto if clicked
    go = raw_input('This script makes all files writeable; continue?') 
    if go != 'y':
        raw_input('Canceled - hit enter key')
    else:
        walker = FixReadOnly(listonly)
        walker.run(  )
        print 'Visited %d files and %d dirs' % (walker.fcount, walker.dcount)

As we saw in Chapter 2, the built-in os.chmod call changes the permission settings on an external file (here, to 0777 -- global read, write, and execute permissions). Because os.chmod and the FileVisitor's operations are portable, this same script will work to set permissions in an entire tree on both Windows and Unix-like platforms. Notice that it asks whether you really want to proceed when it first starts up, just in case someone accidentally clicks the file's name in an explorer GUI. Also note that Python must be installed before this script can be run to make files writable; that seems a fair assumption to make of users about to change Python scripts.

C:\temp\examples>python PyTools\fixreadonly-all.py 
This script makes all files writeable; continue?
. ...
1 => .\autoexec.bat
2 => .\cleanall.csh
3 => .\echoEnvironment.pyw
 ...more deleted...
1352 => .\PyTools\visitor_find.pyc
1353 => .\PyTools\visitor_find_quiet1.py
1354 => .\PyTools\fixeoln_one.doc.txt
Visited 1354 files and 119 dirs

5.6 Copying Directory Trees

The next three sections conclude this chapter by exploring a handful of additional utilities for processing directories (a.k.a. "folders") on your computer with Python. They present directory copy, deletion, and comparison scripts that demonstrate system tools at work. All of these were born of necessity, are generally portable among all Python platforms, and illustrate Python development concepts along the way.

Some of these scripts do something too unique for the visitor module's classes we've been applying in early sections of this chapter, and so require more custom solutions (e.g., we can't remove directories we intend to walk through). Most have platform-specific equivalents too (e.g., drag-and-drop copies), but the Python utilities shown here are portable, easily customized, callable from other scripts, and surprisingly fast.

5.6.1 A Python Tree Copy Script

My CD writer sometimes does weird things. In fact, copies of files with odd names can be totally botched on the CD, even though other files show up in one piece. That's not necessarily a show-stopper -- if just a few files are trashed in a big CD backup copy, I can always copy the offending files to floppies one at a time. Unfortunately, Windows drag-and-drop copies don't play nicely with such a CD: the copy operation stops and exits the moment the first bad file is encountered. You only get as many files as were copied up to the error, but no more.

There may be some magical Windows setting to work around this feature, but I gave up hunting for one as soon as I realized that it would be easier to code a copier in Python. The cpall.py script in Example 5-20 is one way to do it. With this script, I control what happens when bad files are found -- skipping over them with Python exception handlers, for instance. Moreover, this tool works with the same interface and effect on other platforms. It seems to me, at least, that a few minutes spent writing a portable and reusable Python script to meet a need is a better investment than looking for solutions that work on only one platform (if at all).

Example 5-20. PP2E\System\Filetools\cpall.py
#########################################################
# Usage: "python cpall.py dir1 dir2".
# Recursive copy of a directory tree. Works like a 
# unix "cp -r dirFrom/* dirTo" command, and assumes 
# that dirFrom and dirTo are both directories.  Was
# written to get around fatal error messages under 
# Windows drag-and-drop copies (the first bad file 
# ends the entire copy operation immediately), but 
# also allows you to customize copy operations.
# May need more on Unix--skip links, fifos, etc.  
#########################################################
 
import os, sys
verbose = 0
dcount = fcount = 0
maxfileload = 100000
blksize = 1024 * 8
 
def cpfile(pathFrom, pathTo, maxfileload=maxfileload):
    """
    copy file pathFrom to pathTo, byte for byte
    """
    if os.path.getsize(pathFrom) <= maxfileload:
        bytesFrom = open(pathFrom, 'rb').read(  )   # read small file all at once
        open(pathTo, 'wb').write(bytesFrom)       # need b mode on Windows
    else:
        fileFrom = open(pathFrom, 'rb')           # read big files in chunks
        fileTo   = open(pathTo,   'wb')           # need b mode here too 
        while 1:
            bytesFrom = fileFrom.read(blksize)    # get one block, less at end
            if not bytesFrom: break               # empty after last chunk
            fileTo.write(bytesFrom)
 
def cpall(dirFrom, dirTo):
    """
    copy contents of dirFrom and below to dirTo
    """
    global dcount, fcount
    for file in os.listdir(dirFrom):                      # for files/dirs here
        pathFrom = os.path.join(dirFrom, file)
        pathTo   = os.path.join(dirTo,   file)            # extend both paths
        if not os.path.isdir(pathFrom):                   # copy simple files
            try:
                if verbose > 1: print 'copying', pathFrom, 'to', pathTo
                cpfile(pathFrom, pathTo)
                fcount = fcount+1
            except:
                print 'Error copying', pathFrom, to, pathTo, '--skipped'
                print sys.exc_type, sys.exc_value
        else:
            if verbose: print 'copying dir', pathFrom, 'to', pathTo
            try:
                os.mkdir(pathTo)                          # make new subdir
                cpall(pathFrom, pathTo)                   # recur into subdirs
                dcount = dcount+1
            except:
                print 'Error creating', pathTo, '--skipped'
                print sys.exc_type, sys.exc_value
 
def getargs(  ):
    try:
        dirFrom, dirTo = sys.argv[1:]
    except:
        print 'Use: cpall.py dirFrom dirTo'
    else:
        if not os.path.isdir(dirFrom):
            print 'Error: dirFrom is not a directory'
        elif not os.path.exists(dirTo):
            os.mkdir(dirTo)
            print 'Note: dirTo was created'
            return (dirFrom, dirTo)
        else:
            print 'Warning: dirTo already exists'
            if dirFrom == dirTo or (hasattr(os.path, 'samefile') and
                                    os.path.samefile(dirFrom, dirTo)):
                print 'Error: dirFrom same as dirTo'
            else:
                return (dirFrom, dirTo)
 
if __name__ == '__main__':
    import time
    dirstuple = getargs(  )
    if dirstuple: 
        print 'Copying...'
        start = time.time(  )
        apply(cpall, dirstuple)
        print 'Copied', fcount, 'files,', dcount, 'directories',
        print 'in', time.time(  ) - start, 'seconds'

This script implements its own recursive tree traversal logic, and keeps track of both the "from" and "to" directory paths as it goes. At every level, it copies over simple files, creates directories in the "to" path, and recurs into subdirectories with "from" and "to" paths extended by one level. There are other ways to code this task (e.g., other cpall variants on the book's CD change the working directory along the way with os.chdir calls), but extending paths on descent works well in practice.

Notice this script's reusable cpfile function -- just in case there are multigigabyte files in the tree to be copied, it uses a file's size to decide whether it should be read all at once or in chunks (remember, the file read method without arguments really loads the while file into an in-memory string). Also note that this script creates the "to" directory if needed, but assumes it is empty when a copy starts up; be sure to remove the target directory before copying a new tree to its name (more on this in the next section).

Here is a big book examples tree copy in action on Windows; pass in the name of the "from" and "to" directories to kick off the process, and run a rm shell command (or similar platform-specific tool) to delete the target directory first:

C:\temp>rm -rf cpexamples
 
C:\temp>python %X%\system\filetools\cpall.py examples cpexamples
Note: dirTo was created
Copying...
Copied 1356 files, 118 directories in 2.41999995708 seconds
 
C:\temp>fc /B examples\System\Filetools\cpall.py
              cpexamples\System\Filetools\cpall.py
Comparing files examples\System\Filetools\cpall.py and 
cpexamples\System\Filetools\cpall.py
FC: no differences encountered

This run copied a tree of 1356 files and 118 directories in 2.4 seconds on my 650 MHz Windows 98 laptop (the built-in time.time call can be used to query the system time in seconds). It runs a bit slower if programs like MS Word are open on the machine, and may run arbitrarily faster or slower for you. Still, this is at least as fast as the best drag-and-drop I've timed on Windows.

So how does this script work around bad files on a CD backup? The secret is that it catches and ignores file exceptions, and keeps walking. To copy all the files that are good on a CD, I simply run a command line like this:

C:\temp>python %X%\system\filetools\cpall_visitor.py 
                            g:\PP2ndEd\examples\PP2E cpexamples

Because the CD is addressed as "G:" on my Windows machine, this is the command-line equivalent of drag-and-drop copying from an item in the CD's top-level folder, except that the Python script will recover from errors on the CD and get the rest. In general, cpall can be passed any absolute directory path on your machine -- even ones that mean devices like CDs. To make this go on Linux, try a root directory like /dev/cdrom to address your CD drive.

5.6.2 Recoding Copies with a Visitor-Based Class

When I first wrote the cpall script just discussed, I couldn't see a way that the visitor class hierarchy we met earlier would help -- two directories needed to be traversed in parallel (the original and the copy), and visitor is based on climbing one tree with os.path.walk. There seemed no easy way to keep track of where the script is at in the copy directory.

The trick I eventually stumbled onto is to not keep track at all. Instead, the script in Example 5-21 simply replacesthe "from" directory path string with the "to" directory path string, at the front of all directory and pathnames passed-in from os.path.walk. The results of the string replacements are the paths that the original files and directories are to be copied to.

Example 5-21. PP2E\System\Filetools\cpall_visitor.py
###########################################################
# Use: "python cpall_visitor.py fromDir toDir"
# cpall, but with the visitor classes and os.path.walk;
# the trick is to do string replacement of fromDir with
# toDir at the front of all the names walk passes in;
# assumes that the toDir does not exist initially;
###########################################################
 
import os
from PP2E.PyTools.visitor import FileVisitor
from cpall import cpfile, getargs
verbose = 1
 
class CpallVisitor(FileVisitor):
    def __init__(self, fromDir, toDir):
        self.fromDirLen = len(fromDir) + 1
        self.toDir      = toDir
        FileVisitor.__init__(self)
    def visitdir(self, dirpath):
        toPath = os.path.join(self.toDir, dirpath[self.fromDirLen:])
        if verbose: print 'd', dirpath, '=>', toPath
        os.mkdir(toPath)
        self.dcount = self.dcount + 1
    def visitfile(self, filepath):
        toPath = os.path.join(self.toDir, filepath[self.fromDirLen:])
        if verbose: print 'f', filepath, '=>', toPath
        cpfile(filepath, toPath)
        self.fcount = self.fcount + 1
 
if __name__ == '__main__':
    import sys, time
    fromDir, toDir = sys.argv[1:3]
    if len(sys.argv) > 3: verbose = 0 
    print 'Copying...'
    start = time.time(  )
    walker = CpallVisitor(fromDir, toDir)
    walker.run(startDir=fromDir)
    print 'Copied', walker.fcount, 'files,', walker.dcount, 'directories',
    print 'in', time.time(  ) - start, 'seconds'

This version accomplishes roughly the same goal as the original, but has made a few assumptions to keep code simple -- the "to" directory is assumed to not exist initially, and exceptions are not ignored along the way. Here it is copying the book examples tree again on Windows:

C:\temp>rm -rf cpexamples
 
C:\temp>python %X%\system\filetools\cpall_visitor.py 
                                           examples cpexamples -quiet
Copying...
Copied 1356 files, 119 directories in 2.09000003338 seconds
 
C:\temp>fc /B examples\System\Filetools\cpall.py 
              cpexamples\System\Filetools\cpall.py
Comparing files examples\System\Filetools\cpall.py and 
cpexamples\System\Filetools\cpall.py
FC: no differences encountered

Despite the extra string slicing going on, this version runs just as fast as the original. For tracing purposes, this version also prints all the "from" and "to" copy paths during the traversal, unless you pass in a third argument on the command line, or set the script's verbose variable to 0:

C:\temp>python %X%\system\filetools\cpall_visitor.py examples cpexamples 
Copying...
d examples => cpexamples\
f examples\autoexec.bat => cpexamples\autoexec.bat
f examples\cleanall.csh => cpexamples\cleanall.csh
 ...more deleted...
d examples\System => cpexamples\System
f examples\System\System.txt => cpexamples\System\System.txt
f examples\System\more.py => cpexamples\System\more.py
f examples\System\reader.py => cpexamples\System\reader.py
 ...more deleted...
Copied 1356 files, 119 directories in 2.31000006199 seconds

5.7 Deleting Directory Trees

Both of the copy scripts in the last section work as planned, but they aren't very forgiving of existing directory trees. That is, they implicitly assume that the "to" target directory is either empty or doesn't exist at all, and fail badly if that isn't the case. Presumably, you will first somehow delete the target directory on your machine. For my purposes, that was a reasonable assumption to make.

The copiers could be changed to work with existing "to" directories too (e.g., ignore os.mkdir exceptions), but I prefer to start from scratch when copying trees; you never know what old garbage might be laying around in the "to" directory. So when testing the copies above, I was careful to run a rm -rf cpexamples command line to recursively delete the entire cpexamples directory tree before copying another tree to that name.

Unfortunately, the rm command used to clear the target directory is really a Unix utility that I installed on my PC from a commercial package; it probably won't work on your computer. There are other platform-specific ways to delete directory trees (e.g., deleting a folder's icon in a Windows explorer GUI), but why not do it once in Python for every platform? Example 5-22 deletes every file and directory at and below a passed-in directory's name. Because its logic is packaged as a function, it is also an importable utility that can be run from other scripts. Because it is pure Python code, it is a cross-platform solution for tree removal.

Example 5-22. PP2E\System\Filetools\rmall.py
#!/usr/bin/python
################################################################
# Use: "python rmall.py directoryPath directoryPath..."
# recursive directory tree deletion: removes all files and 
# directories at and below directoryPaths; recurs into subdirs
# and removes parent dir last, because os.rmdir requires that 
# directory is empty; like a Unix "rm -rf directoryPath" 
################################################################ 
 
import sys, os
fcount = dcount = 0
 
def rmall(dirPath):                             # delete dirPath and below
    global fcount, dcount
    namesHere = os.listdir(dirPath)
    for name in namesHere:                      # remove all contents first
        path = os.path.join(dirPath, name)
        if not os.path.isdir(path):             # remove simple files
            os.remove(path)
            fcount = fcount + 1
        else:                                   # recur to remove subdirs
            rmall(path)
    os.rmdir(dirPath)                           # remove now-empty dirPath
    dcount = dcount + 1
 
if __name__ == '__main__':
    import time
    start = time.time(  )
    for dname in sys.argv[1:]: rmall(dname)
    tottime = time.time(  ) - start
    print 'Removed %d files and %d dirs in %s secs' % (fcount, dcount, tottime)

The great thing about coding this sort of tool in Python is that it can be run with the same command-line interface on any machine where Python is installed. If you don't have a rm -rf type command available on your Windows, Unix, or Macintosh computer, simply run the Python rmall script instead:

C:\temp>python %X%\System\Filetools\cpall.py examples cpexamples
Note: dirTo was created
Copying...
Copied 1379 files, 121 directories in 2.68999993801 seconds
 
C:\temp>python %X%\System\Filetools\rmall.py cpexamples
Removed 1379 files and 122 dirs in 0.549999952316 secs
 
C:\temp>ls cpexamples
ls: File or directory "cpexamples" is not found

Here, the script traverses and deletes a tree of 1379 files and 122 directories in about half a second -- substantially impressive for a noncompiled programming language, and roughly equivalent to the commercial rm -rf program I purchased and installed on my PC.

One subtlety here: this script must be careful to delete the contents of a directory before deleting the directory itself -- the os.rmdir call mandates that directories must be empty when deleted (and throws an exception if they are not). Because of that, the recursive calls on subdirectories need to happen before the os.mkdir call. Computer scientists would recognize this as a postorder, depth-first tree traversal, since we process parent directories after their children. This also makes any traversals based on os.path.walk out of the question: we need to return to a parent directory to delete it after visiting its descendents.

To illustrate, let's run interactive os.remove and os.rmdir calls on a cpexample directory containing files or nested directories:

>>> os.path.isdir('cpexamples')
1
>>> os.remove('cpexamples')
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OSError: [Errno 2] No such file or directory: 'cpexamples'
>>> os.rmdir('cpexamples')
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OSError: [Errno 13] Permission denied: 'cpexamples'

Both calls always fail if the directory is not empty. But now, delete the contents of cpexamples in another window and try again:

>>> os.path.isdir('cpexamples')
1
>>> os.remove('cpexamples')
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OSError: [Errno 2] No such file or directory: 'cpexamples'
>>> os.rmdir('cpexamples')
>>> os.path.exists('cpexamples')
0

The os.remove still fails -- it's only meant for deleting nondirectory items -- but os.rmdir now works because the directory is empty. The upshot of this is that a tree deletion traversal must generally remove directories "on the way out."

5.7.1 Recoding Deletions for Generality

As coded, the rmall script only processes directory names and fails if given names of simple files, but it's trivial to generalize the script to eliminate that restriction. The recoding in Example 5-23 accepts an arbitrary command-line list of file and directory names, deletes simple files, and recursively deletes directories.

Example 5-23. PP2E\System\Filetools\rmall2.py
#!/usr/bin/python
################################################################
# Use: "python rmall2.py fileOrDirPath fileOrDirPath..."
# like rmall.py, alternative coding, files okay on cmd line
################################################################ 
 
import sys, os
fcount = dcount = 0
 
def rmone(pathName):
    global fcount, dcount
    if not os.path.isdir(pathName):               # remove simple files
        os.remove(pathName)
        fcount = fcount + 1
    else:                                         # recur to remove contents
        for name in os.listdir(pathName):
            rmone(os.path.join(pathName, name))
        os.rmdir(pathName)                        # remove now-empty dirPath
        dcount = dcount + 1
 
if __name__ == '__main__':
    import time
    start = time.time(  )
    for name in sys.argv[1:]: rmone(name)
    tottime = time.time(  ) - start
    print 'Removed %d files and %d dirs in %s secs' % (fcount, dcount, tottime)

This shorter version runs the same, and just as fast, as the original:

C:\temp>python %X%\System\Filetools\cpall.py examples cpexamples
Note: dirTo was created
Copying...
Copied 1379 files, 121 directories in 2.52999997139 seconds
 
C:\temp>python %X%\System\Filetools\rmall2.py cpexamples
Removed 1379 files and 122 dirs in 0.550000071526 secs
 
C:\temp>ls cpexamples
ls: File or directory "cpexamples" is not found

but can also be used to delete simple files:

C:\temp>python %X%\System\Filetools\rmall2.py spam.txt eggs.txt
Removed 2 files and 0 dirs in 0.0600000619888 secs
 
C:\temp>python %X%\System\Filetools\rmall2.py spam.txt eggs.txt cpexamples
Removed 1381 files and 122 dirs in 0.630000042915 secs

As usual, there is more than one way to do it in Python (though you'll have to try harder to find many spurious ways). Notice that these scripts trap no exceptions; in programs designed to blindly delete an entire directory tree, exceptions are all likely to denote truly bad things. We could get more fancy, and support filename patterns by using the built-in fnmatch module along the way too, but this was beyond the scope of these script's goals (for pointers on matching, see Example Example 5-17, and also find.py in Chapter 2).

5.8 Comparing Directory Trees

Engineers can be a paranoid sort (but you didn't hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there's still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it's tough to have a lot of faith in tools that occasionally trash files, and seem to crash my Windows 98 machine every third Tuesday of the month. When push comes to shove, it's nice to be able to verify that data copied to a backup CD is the same as the original -- or at least spot deviations from the original -- as soon as possible. If a backup is ever needed, it will be really needed.

Because data CDs are accessible as simple directory trees, we are once again in the realm of tree walkers -- to verify a backup CD, we simply need to walk its top-level directory. We've already written a generic walker class (the visitor module), but it won't help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won't allow us to spot files and directories that only exist in the other. Something more custom seems in order here.

5.8.1 Finding Directory Differences

Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depth, and so on.

More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level's directory.

In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested, and simpler operation, let's start by coding a single-directory comparison of filenames in Example 5-24.

Example 5-24. PP2E\System\Filetools\dirdiff.py
#!/bin/env python
########################################################
# use: python dirdiff.py dir1-path dir2-path
# compare two directories to find files that exist 
# in one but not the other;  this version uses the
# os.listdir function and list difference;  note 
# that this script only checks filename, not file
# contents--see diffall.py for an extension that
# does the latter by comparing .read(  ) results;
########################################################
 
import os, sys
 
def reportdiffs(unique1, unique2, dir1, dir2):
    if not (unique1 or unique2):
        print 'Directory lists are identical'
    else:
        if unique1:
            print 'Files unique to', dir1 
            for file in unique1: 
                print '...', file        
        if unique2:
            print 'Files unique to', dir2 
            for file in unique2: 
                print '...', file        
 
def unique(seq1, seq2):
    uniques = []                      # return items in seq1 only
    for item in seq1:
        if item not in seq2:
            uniques.append(item)
    return uniques
 
def comparedirs(dir1, dir2):
    print 'Comparing', dir1, 'to', dir2
    files1  = os.listdir(dir1)
    files2  = os.listdir(dir2)
    unique1 = unique(files1, files2)
    unique2 = unique(files2, files1)
    reportdiffs(unique1, unique2, dir1, dir2)
    return not (unique1 or unique2)               # true if no diffs
 
def getargs(  ):
    try:
        dir1, dir2 = sys.argv[1:]                 # 2 command-line args
    except:
        print 'Usage: dirdiff.py dir1 dir2'
        sys.exit(1)
    else:
        return (dir1, dir2)
 
if __name__ == '__main__':
    dir1, dir2 = getargs(  )
    comparedirs(dir1, dir2)

Given listings of names in two directories, this script simply picks out unique names in the first, unique names in the second, and reports any unique names found as differences (that is, files in one directory but not the other). Its comparedirs function returns a true result if no differences were found -- useful for detecting differences in callers.

Let's run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we'll add the latter in a moment):

C:\temp>python %X%\system\filetools\dirdiff.py examples cpexamples
Comparing examples to cpexamples
Directory lists are identical
 
C:\temp>python %X%\system\filetools\dirdiff.py 
                                  examples\PyTools cpexamples\PyTools
Comparing examples\PyTools to cpexamples\PyTools
Files unique to examples\PyTools
... visitor.py
 
C:\temp>python %X%\system\filetools\dirdiff.py 
                                  examples\System\Filetools
                                  cpexamples\System\Filetools
Comparing examples\System\Filetools to cpexamples\System\Filetools
Files unique to examples\System\Filetools
... dirdiff2.py
Files unique to cpexamples\System\Filetools
... cpall.py

The unique function is the heart of this script: it performs a simple list difference operation. Here's how it works apart from the rest of this script's code:

>>> L1 = [1, 3, 5, 7, 9]
>>> L2 = [2, 3, 6, 8, 9]
>>> from dirdiff import unique
>>> unique(L1, L2)                  # items in L1 but not in L2
[1, 5, 7]
>>> unique(L2, L1)                  # items in L2 but not in L1
[2, 6, 8]

These two lists have objects 3 and 9 in common; the rest appear only in one of the two. When applied to directories, unique items represent tree differences, and common items are names of files or subdirectories that merit further comparisons or traversal. There are other ways to check this code; see the dirdiff variants in the book's CD for a few.

5.8.2 Finding Tree Differences

Now all we need is a tree walker that applies dirdiff at each level to pick out unique files and directories, explicitly compares the contents of files in common, and descends through directories in common. Example 5-25 fits the bill.

Example 5-25. PP2E\System\Filetools\diffall.py
#########################################################
# Usage: "python diffall.py dir1 dir2".
# recursive tree comparison--report files that exist 
# in only dir1 or dir2, report files of same name in 
# dir1 and dir2 with differing contents, and do the 
# same for all subdirectories of the same names in 
# and below dir1 and dir2; summary of diffs appears 
# at end of output (but search redirected output for 
# "DIFF" and "unique" strings for further details);
#########################################################
 
import os, dirdiff
 
def intersect(seq1, seq2):
    commons = []               # items in seq1 and seq2
    for item in seq1:
        if item in seq2:
            commons.append(item)
    return commons
 
def comparedirs(dir1, dir2, diffs, verbose=0):
    # compare filename lists
    print '-'*20
    if not dirdiff.comparedirs(dir1, dir2):
        diffs.append('unique files at %s - %s' % (dir1, dir2))
 
    print 'Comparing contents'
    files1 = os.listdir(dir1)
    files2 = os.listdir(dir2)
    common = intersect(files1, files2)
 
    # compare contents of files in common
    for file in common:
        path1 = os.path.join(dir1, file)
        path2 = os.path.join(dir2, file)
        if os.path.isfile(path1) and os.path.isfile(path2):
            bytes1 = open(path1, 'rb').read(  )
            bytes2 = open(path2, 'rb').read(  )
            if bytes1 == bytes2:
                if verbose: print file, 'matches'
            else:
                diffs.append('files differ at %s - %s' % (path1, path2))
                print file, 'DIFFERS'
 
    # recur to compare directories in common
    for file in common:
        path1 = os.path.join(dir1, file)
        path2 = os.path.join(dir2, file)
        if os.path.isdir(path1) and os.path.isdir(path2):
            comparedirs(path1, path2, diffs, verbose)
 
if __name__ == '__main__':
    dir1, dir2 = dirdiff.getargs(  )
    mydiffs = []
    comparedirs(dir1, dir2, mydiffs)        # changes mydiffs in-place
    print '='*40                            # walk, report diffs list
    if not mydiffs:
        print 'No diffs found.'
    else:
        print 'Diffs found:', len(mydiffs)
        for diff in mydiffs: print '-', diff

At each directory in the tree, this script simply runs the dirdiff tool to detect unique names, and then compares names in common by intersecting directory lists. Since we've already studied the tree-walking tools this script employs, let's jump right into a few example runs. When run on identical trees, status messages scroll during the traversal, and a "No diffs found" message appears at the end:

C:\temp>python %X%\system\filetools\diffall.py examples cpexamples 
--------------------
Comparing examples to cpexamples
Directory lists are identical
Comparing contents
--------------------
Comparing examples\old_Part2 to cpexamples\old_Part2
Directory lists are identical
Comparing contents
--------------------
 ...more lines deleted...
--------------------
Comparing examples\EmbExt\Regist to cpexamples\EmbExt\Regist
Directory lists are identical
Comparing contents
--------------------
Comparing examples\PyTools to cpexamples\PyTools
Directory lists are identical
Comparing contents
========================================
No diffs found.

To show how differences are reported, we need to generate a few. Let's run the global search-and-replace script we met earlier, to change a few files scattered about one of the trees (see -- I told you global replacement could trash your files!):

C:\temp\examples>python %X%\PyTools\visitor_replace.py -exec SPAM
Are you sure?y
...
1355 => .\PyTools\visitor_find_quiet1.py
1356 => .\PyTools\fixeoln_one.doc.txt
Visited 1356 files
Changed 8 files:
.\package.csh
.\README-PP2E.txt
.\readme-old-pp1E.txt
.\temp
.\remp
.\Internet\Cgi-Web\fixcgi.py
.\System\Processes\output.txt
.\PyTools\cleanpyc.py

While we're at it, let's remove a few common files so directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees):

C:\temp>rm cpexamples\PyTools\visitor.py
C:\temp>rm cpexamples\System\Filetools\dirdiff2.py
C:\temp>rm examples\System\Filetools\cpall.py

Now, rerun the comparison walker to pick out differences, and pipe its output report to a file for easy inspection. The following lists just the parts of the output report that identify differences. In typical use, I inspect the summary at the bottom of the report first, and then search for strings "DIFF" and "unique" in the report's text if I need more information about the differences summarized:

C:\temp>python %X%\system\filetools\diffall.py examples cpexamples > diffs
C:\temp>type diffs
--------------------
Comparing examples to cpexamples
Directory lists are identical
Comparing contents
package.csh DIFFERS
README-PP2E.txt DIFFERS
readme-old-pp1E.txt DIFFERS
temp DIFFERS
remp DIFFERS
--------------------
Comparing examples\old_Part2 to cpexamples\old_Part2
Directory lists are identical
Comparing contents
--------------------
...
--------------------
Comparing examples\Internet\Cgi-Web to cpexamples\Internet\Cgi-Web
Directory lists are identical
Comparing contents
fixcgi.py DIFFERS
--------------------
...
--------------------
Comparing examples\System\Filetools to cpexamples\System\Filetools
Files unique to examples\System\Filetools
... dirdiff2.py
Files unique to cpexamples\System\Filetools
... cpall.py
Comparing contents
--------------------
...
--------------------
Comparing examples\System\Processes to cpexamples\System\Processes
Directory lists are identical
Comparing contents
output.txt DIFFERS
--------------------
...
--------------------
Comparing examples\PyTools to cpexamples\PyTools
Files unique to examples\PyTools
... visitor.py
Comparing contents
cleanpyc.py DIFFERS
========================================
Diffs found: 10
- files differ at examples\package.csh - cpexamples\package.csh
- files differ at examples\README-PP2E.txt - cpexamples\README-PP2E.txt
- files differ at examples\readme-old-pp1E.txt - cpexamples\readme-old-pp1E.txt
- files differ at examples\temp - cpexamples\temp
- files differ at examples\remp - cpexamples\remp
- files differ at examples\Internet\Cgi-Web\fixcgi.py - 
                       cpexamples\Internet\Cgi-Web\fixcgi.py
- unique files at examples\System\Filetools - 
                       cpexamples\System\Filetools
- files differ at examples\System\Processes\output.txt - 
                       cpexamples\System\Processes\output.txt
- unique files at examples\PyTools - cpexamples\PyTools
- files differ at examples\PyTools\cleanpyc.py - cpexamples\PyTools\cleanpyc.py

I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. Ten differences were found -- the eight files we changed (trashed) with the replacement script, and the two directories we threw out of sync with the three rm remove commands.

5.8.2.1 Verifying CD backups

So how does this script placate CD backup paranoia? To double-check my CD writer's work, I run a command like the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is "G:" on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom on Linux:

C:\temp>python %X%\system\filetools\diffall.py  
                    examples g:\PP2ndEd\examples\PP2E > exdiffs091500 
 
C:\temp>more exdiffs091500 
--------------------
Comparing examples to g:\PP2ndEd\examples\PP2E
Files unique to examples
... .cshrc
Comparing contents
tounix.py DIFFERS
--------------------
Comparing examples\old_Part2 to g:\PP2ndEd\examples\PP2E\old_Part2
Directory lists are identical
Comparing contents
--------------------
 ...more
visitor_fixeoln.py DIFFERS
visitor_fixnames.py DIFFERS
========================================
Diffs found: 41
- unique files at examples - g:\PP2ndEd\examples\PP2E
- files differ at examples\tounix.py - g:\PP2ndEd\examples\PP2E\tounix.py
 ...more

The CD spins, the script compares, and a summary of 41 differences appears at the end of the report (in this case, representing changes to the examples tree since the latest backup CD was burned). For an example of a full difference report, see file exdiffs091500 on the book's CD. More typically, this is what turns up for most of my example backups -- files with a leading "." are not copied to the CD:

C:\temp>python %X%\System\Filetools\diffall.py 
                    examples g:\PP2ndEd\examples\PP2E
...
--------------------
Comparing examples\Config to g:\PP2ndEd\examples\PP2E\Config
Files unique to examples\Config
... .cshrc
Comparing contents
========================================
Diffs found: 1
- unique files at examples\Config - g:\PP2ndEd\examples\PP2E\Config

And to really be sure, I run the following global comparison command against the true book directory, to verify the entire book tree backup on CD:

C:\>python %X%\System\Filetools\diffall.py PP2ndEd G:\PP2ndEd 
--------------------
Comparing PP2ndEd to G:\PP2ndEd
Files unique to G:\PP2ndEd
... examples.tar.gz
Comparing contents
README.txt DIFFERS
--------------------
 ...more
--------------------
Comparing PP2ndEd\examples\PP2E\Config to G:\PP2ndEd\examples\PP2E\Config
Files unique to PP2ndEd\examples\PP2E\Config
... .cshrc
Comparing contents
--------------------
 ...more
--------------------
Comparing PP2ndEd\chapters to G:\PP2ndEd\chapters
Directory lists are identical
Comparing contents
ch01-intro.doc DIFFERS
ch04-os3.doc DIFFERS
ch05-gui1.doc DIFFERS
ch06-gui2.doc DIFFERS
--------------------
 ...more
========================================
Diffs found: 11
- unique files at PP2ndEd - G:\PP2ndEd
- files differ at PP2ndEd\README.txt - G:\PP2ndEd\README.txt
 ...more

This particular run indicates that I've changed a "readme" file, four chapter files, and a bunch more since the last backup; if run immediately after making a backup, only the .cshrc unique file shows up on diffall radar. This global comparison can take a few minutes -- it performs byte-for-byte comparisons of all chapter files and screen shots, the examples tree, an image of the book's CD, and more, but it's an accurate and complete verification. Given that this book tree contained roughly 119M of data in 7300 files and 570 directories the last time I checked, a more manual verification procedure without Python's help would be utterly impossible.

Finally, it's worth noting that this script still only detects differences in the tree, but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with a single string comparison -- it's a simple yes/no result.[11] If and when I need more details about how two reported files actually differ, I either edit the files, or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That's not a portable solution for this last step; but for my purposes, just finding the differences in a 1300-file tree was much more critical than reporting which lines differ in files flagged in the report.

Of course, since we can always run shell commands in Python, this last step could be automated by spawning a diff or fc command with os.popen as differences are encountered (or after the traversal, by scanning the report summary). Because Python excels at processing files and strings, though, it's possible to go one step further and code a Python equivalent of the fc and diff commands. Since this is beyond both this script's scope and this chapter's size limits, that will have to await the attention of a curious reader.

[1] In fact, see the files old_todos.py, old_tounix.py, and old_toboth.py in the PyTools directory on the examples CD (see http://examples.oreilly.com/python2) for a complete earlier implementation built around string.replace. It was repeatable for to-Unix changes, but not for to-DOS conversion (only the latter may add characters). The fixeoln scripts here were developed as a replacement, after I got burned by running to-DOS conversions twice. [back]

[2] But wait -- it gets worse. Because of the auto-deletion and insertion of \r characters in Windows text mode, we might simply read and write files in text mode to perform the "todos" line conversion when run on Windows; the file interface will automatically add the \r on output if it's missing. However, this fails for other usage modes -- "tounix" conversions on Windows (only binary writes can omit the \r), and "todos" when running on Unix (no \r is inserted). Magic is not always our friend. [back]

[3] Recall that the home directory of a running script is always added to the front of sys.path to give the script import visibility to other files in the script's directory. Because of that, this script would normally load the PP2E\PyTools\find.py module anyhow (not the one in the Python library), by just saying import find; it need not specify the full package path in the import. The try handler and full path import are useful here only if this script is moved to a different source directory. Since I move files a lot, I tend to code with self-inflicted worst-case scenarios in mind. [back]

[4] Except Macs, perhaps -- see Macintosh Line Conversions earlier in this chapter. To convert to Mac format, try replacing the script's import of fixeoln_one to load fixeoln_one_mac. [back]

[5] Interestingly, using string '*' for the patterns list works the same as using list ['*'] here, only because a single-character string is a sequence that contains itself; compare the results of map(find.find, '*') with map(find.find, ['*']) interactively to verify. [back]

[6] Very subtle thing: both versions of this script might fail on platforms where case matters, if they rename directoriesalong the way. If a directory is renamed before the contents of that directory have been visited (e.g., a directory SPAM renamed to spam), then later reference to the directory's contents using the old name (e.g., SPAM/filename) will no longer be valid on case-sensitive platforms. This can happen in the find.find version, because directories can and do show up in the result list before their contents. It's also a potential with the os.path.walk version, because the prior directory path (with original directory names) keeps being extended at each level of the tree. I only use this script on Windows (DOS), so I haven't been bitten by this in practice. Workarounds -- ordering find result lists, walking trees in a bottom-up fashion, making two distinct passes for files and directories, queuing up directory names on a list to be renamed later, or simply not renaming directories at all -- are all complex enough to be delegated to the realm of reader experiments. As a rule of thumb, changing a tree's names or structure while it is being walked is a risky venture. [back]

[7] In fact, the act of searching files often goes by the colloquial name "grepping" among developers who have spent any substantial time in the Unix ghetto. [back]

[8] Due to its limitations, the grep module has been tagged as "deprecated" as of Python 1.6, and may disappear completely in future releases. It was never intended to become a widely reusable tool. Use other tree-walking techniques in this book to search for strings in files, directories, and trees. Of the original Unix-like grep, glob, and find modules in Python's library, only glob remains nondeprecated today (but see also the custom find implementation presented in Chapter 4 ). [back]

[9] See the coverage of regular expressions in Chapter 18. The search_all script here searches for a simple string in each file with string.find, but it would be trivial to extend it to search for a regular expression pattern match instead (roughly, just replace string.find with a call to a regular expression object's search method). Of course, such a mutation will be much more trivial after we've learned how to do it. [back]

[10] For the impatient: see commonhtml.runsilent in the PyMailCgi system presented in Chapter 13. It's a variation on redirect.redirect that discards output as it is printed (instead of retaining it in a string), returns the return value of the function called (not the output string), and lets exceptions pass via a try/finally statement (instead of catching and reporting them with a try/except). It's still redirection at work, though. [back]

[11] We might try to do a bit better here, by opening text files in text mode to ignore line-terminator differences, but it's not clear that such differences should be blindly ignored (what if the caller wants to know if line-end markers have been changed?). We could also be smarter by avoiding the load and compare steps for files that differ in size, and read files in small chunks, instead of all at once, to minimize memory requirements for huge files (see earlier examples such as the cpall script for hints). For my comparisons, such optimizations are unnecessary. [back]

Chapter 4  TOC  Chapter 6