home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Chapter 1  TOC Chapter 3

Chapter 2. System Tools

2.1 "The os.path to Knowledge"

This chapter begins our look at ways to apply Python to real programming tasks. In this and the following chapters, we'll see how to use Python to write system tools, graphical user interfaces, database applications, Internet scripts, web sites, and more. Along the way we'll also study larger Python programming concepts in action: code reuse, maintainability, object-oriented programming, and so on.

In this first part of the book, we begin our Python programming tour by exploring the systems application domain -- scripts that deal with files, programs, and the environment surrounding a program in general. Although the examples in this domain focus on particular kinds of tasks, the techniques they employ will prove to be useful in later parts of the book as well. In other words, you should begin your journey here, unless you already are a Python systems programming wizard.

2.2 Why Python Here?

Python's system interfaces span application domains, but for the next four chapters, most of our examples fall into the category of system tools -- programs sometimes called command-line utilities, shell scripts, or some permutation of such words. Regardless of their title, you are probably familiar with this sort of script already; they accomplish tasks like processing files in a directory, launching test scripts, and so on. Such programs historically have been written in nonportable and syntactically obscure shell languages such as DOS batch files, csh, and awk.

Even in this relatively simple domain, though, some of Python's better attributes shine brightly. For instance, Python's ease of use and extensive built-in library make it simple (and even fun) to use advanced system tools such as threads, signals, forks, sockets, and their kin; such tools are much less accessible under the obscure syntax of shell languages and the slow development cycles of compiled languages. Python's support for concepts like code clarity and object-oriented programming also help us write shell tools that can be read, maintained, and reused. When using Python, there is no need to start every new script from scratch.

Moreover, we'll find that Python not only includes all the interfaces we need to write system tools, it also fosters script portability. By employing Python's standard library, most system scripts written in Python are automatically portable to all major platforms. A Python directory-processing script written in Windows, for instance, can usually also be run in Linux without changing its source code at all -- simply copy over the source code. If used well, Python is the only system scripting tool you need to know.

"Batteries Included"

This chapter and those that follow deal with both the Python language and its standard library. Although Python itself provides an easy-to-use scripting language, much of the action in real Python development involves the vast library of programming tools (some 200 modules at last count) that ship with the Python package. In fact, the standard libraries are so powerful that it is not uncommon to hear Python described by the term "batteries included" -- a phrase generally credited to Frank Stajano, meaning that most of what you need for real day-to-day work is already there for the importing.

As we'll see, the standard libraries form much of the challenge in Python programming. Once you've mastered the core language, you'll find that most of your time is spent applying the built-in functions and modules that come with the system. On the other hand, libraries are where most of the fun happens. In practice, programs become most interesting when they start using services external to the language interpreter: networks, files, GUIs, databases, and so on. All of these are supported in the Python standard library, a collection of precoded modules written in Python and C that are installed with the Python interpreter.

Beyond the Python standard library, there is an additional collection of third-party packages for Python that must be fetched and installed separately. At this writing, most of these third-party extensions can be found via searches and links at http://www.python.org, and at the "Starship" and "Vaults of Parnassus" Python sites (also reachable from links at http://www.python.org). If you have to do something special with Python, chances are good that you can find a free and open source module that will help. Most of the tools we'll employ in this text are a standard part of Python, but I'll be careful to point out things that must be installed separately.

 

 

2.3 System Scripting Overview

The next two sections will take a quick tour through sys and os, before this chapter moves on to larger system programming concepts. As I'm not going to demonstrate every item in every built-in module, the first thing I want to do is show you how to get more details on your own. Officially, this task also serves as an excuse for introducing a few core system scripting concepts -- along the way, we'll code a first script to format documentation.

2.3.1 Python System Modules

Most system-level interfaces in Python are shipped in just two modules: sys and os. That's somewhat oversimplified; other standard modules belong to this domain too (e.g., glob, socket, thread, time, fcntl), and some built-in functions are really system interfaces as well (e.g., open). But sys and os together form the core of Python's system tools arsenal.

In principle at least, sys exports components related to the Python interpreter itself (e.g., the module search path), and os contains variables and functions that map to the operating system on which Python is run. In practice, this distinction may not always seem clear-cut (e.g., the standard input and output streams show up in sys, but they are at least arguably tied to operating system paradigms). The good news is that you'll soon use the tools in these modules so often that their locations will be permanently stamped on your memory.[1]

The os module also attempts to provide a portable programming interface to the underlying operating system -- its functions may be implemented differently on different platforms, but they look the same everywhere to Python scripts. In addition, the os module exports a nested submodule, os.path, that provides a portable interface to file and directory processing tools.

2.3.2 Module Documentation Sources

As you can probably deduce from the preceding paragraphs, learning to write system scripts in Python is mostly a matter of learning about Python's system modules. Luckily, there are a variety of information sources to make this task easier -- from module attributes to published references and books.

For instance, if you want to know everything that a built-in module exports, you can either read its library manual entry, study its source code (Python is open source software, after all), or fetch its attribute list and documentation string interactively. Let's import sys and see what it's got:

C:\...\PP2E\System> python
>>> import sys
>>> dir(sys)
['__doc__', '__name__', '__stderr__', '__stdin__', '__stdout__', 'argv',
'builtin_module_names', 'copyright', 'dllhandle', 'exc_info', 'exc_type',
'exec_prefix', 'executable', 'exit', 'getrefcount', 'hexversion', 'maxint',
'modules', 'path', 'platform', 'prefix', 'ps1', 'ps2', 'setcheckinterval',
'setprofile', 'settrace', 'stderr', 'stdin', 'stdout', 'version', 'winver']

The dir function simply returns a list containing the string names of all the attributes in any object with attributes; it's a handy memory-jogger for modules at the interactive prompt. For example, we know there is something called sys.version, because the name version came back in the dir result. If that's not enough, we can always consult the __doc__ string of built-in modules:

>>> sys.__doc__ 
...
 ...lots of text deleted here...
...
count for an object (plus one :-)\012setcheckinterval( ) -- control how often 
the interpreter checks for events\012setprofile( ) -- set the global profiling
function\012settrace( ) -- set the global debug tracing function\012"

2.3.3 Paging Documentation Strings

The __doc__ built-in attribute usually contains a string of documentation, but may look a bit weird when printed -- it's one long string with embedded line-feed characters that print as \012, not a nice list of lines. To format these strings for more humane display, I usually use a utility script like the one in Example 2-1.

Example 2-1. PP2E\System\more.py
#########################################################
# split and interactively page a string or file of text;
#########################################################
 
import string
 
def more(text, numlines=15):
 lines = string.split(text, '\n')
 while lines:
 chunk = lines[:numlines]
 lines = lines[numlines:]
 for line in chunk: print line
 if lines and raw_input('More?') not in ['y', 'Y']: break 
 
if __name__ == '__main__':
 import sys # when run, not imported
 more(open(sys.argv[1]).read( ), 10) # page contents of file on cmdline

The meat of this file is its more function, and if you know any Python at all, it should be fairly straightforward -- it simply splits up a string around end-of-line characters, and then slices off and displays a few lines at a time (15 by default) to avoid scrolling off the screen. A slice expression lines[:15] gets the first 15 items in a list, and lines[15:] gets the rest; to show a different number of lines each time, pass a number to the numlines argument (e.g., the last line in Example 2-1 passes 10 to the numlines argument of the more function).

The string.split built-in call this script employs returns a list of sub-strings (e.g., ["line", "line",...]). As we'll see later in this chapter, the end-of-line character is always \n (which is \012 in octal escape form) within a Python script, no matter what platform it is run upon. (If you don't already know why this matters, DOS \r characters are dropped when read.)

2.3.4 Introducing the string Module

Now, this is a simple Python program, but it already brings up three important topics that merit quick detours here: it uses the string module, reads from a file, and is set up to be run or imported. The Python string module isn't a system-related tool per se, but it sees action in most Python programs. In fact, it is going to show up throughout this chapter and those that follow, so here is a quick review of some of its more useful exports. The string module includes calls for searching and replacing:

>>> import string
>>> string.find('xxxSPAMxxx', 'SPAM') # return first offset
3
>>> string.replace('xxaaxxaa', 'aa', 'SPAM') # global replacement
'xxSPAMxxSPAM'
 
>>> string.strip('\t Ni\n') # remove whitespace
'Ni'

The string.find call returns the offset of the first occurrence of a substring, and string.replace does global search and replacement. With this module, substrings are just strings; in Chapter 18, we'll also see modules that allow regular expression patterns to show up in searches and replacements. The string module also provides constants and functions useful for things like case conversions:

>>> string.lowercase # case constants, converters
'abcdefghijklmnopqrstuvwxyz'
 
>>> string.lower('SHRUBBERRY')
'shrubberry'

There are also tools for splitting up strings around a substring delimiter and putting them back together with a substring between. We'll explore these tools later in this book, but as an introduction, here they are at work:

>>> string.split('aaa+bbb+ccc', '+') # split into substrings list
['aaa', 'bbb', 'ccc']
>>> string.split('a b\nc\nd') # default delimiter: whitespace
['a', 'b', 'c', 'd']
 
>>> string.join(['aaa', 'bbb', 'ccc'], 'NI') # join substrings list
'aaaNIbbbNIccc'
>>> string.join(['A', 'dead', 'parrot']) # default delimiter: space
'A dead parrot'

These calls turn out to be surprisingly powerful. For example, a line of data columns separated by tabs can be parsed into its columns with a single split call; the more.py script uses it to split a string into a list of line strings. In fact, we can emulate the string.replace call with a split/join combination:

>>> string.join(string.split('xxaaxxaa', 'aa'), 'SPAM') # replace the hard way
'xxSPAMxxSPAM'

For future reference, also keep in mind that Python doesn't automatically convert strings to numbers, or vice versa; if you want to use one like the other, you must say so, with manual conversions:

>>> string.atoi("42"), int("42"), eval("42") # string to int conversions
(42, 42, 42)
 
>>> str(42), `42`, ("%d" % 42) # int to string conversions
('42', '42', '42')
 
>>> "42" + str(1), int("42") + 1  # concatenation, addition
('421', 43)

In the last command here, the first expression triggers string concatenation (since both sides are strings) and the second invokes integer addition (because both objects are numbers). Python doesn't assume you meant one or the other and convert automatically; as a rule of thumb, Python tries to avoid magic when possible. String tools will be covered in more detail later in this book (in fact, they get a full chapter in Part IV), but be sure to also see the library manual for additional string module tools.

As of Python 1.6, string objects have grown methods corresponding to functions in the string module. For instance, given a name X assigned to a string object, X.split( ) now does the same work as string.split(X). In Example 2-1, that means that these two lines would be equivalent:

lines = string.split(text, '\n')
lines = text.split('\n')

but the latter form doesn't require an import statement. The string module will still be around for the foreseeable future and beyond, but string methods are likely to be the next wave in the Python text-processing world.

2.3.5 File Operation Basics

The more.py script also opens the external file whose name is listed on the command line with the built-in open function, and reads its text into memory all at once with the file object read method. Since file objects returned by open are part of the core Python language itself, I assume that you have at least a passing familiarity with them at this point in the text. But just in case you've flipped into this chapter early on in your Pythonhood, the calls:

open('file').read( ) # read entire file into string 
open('file').read(N) # read next N bytes into string 
open('file').readlines( ) # read entire file into line strings list
open('file').readline( ) # read next line, through '\n'

load a file's contents into a string, load a fixed size set of bytes into a string, load a file's contents into a list of line strings, and load the next line in the file into a string, respectively. As we'll see in a moment, these calls can also be applied to shell commands in Python. File objects also have write methods for sending strings to the associated file. File-related topics are covered in depth later in this chapter.

2.3.6 Using Programs Two Ways

The last few lines in the more.py file also introduce one of the first big concepts in shell tool programming. They instrument the file to be used two ways: as script or library. Every Python module has a built-in __name__ variable that is set by Python to the string __main__ only when the file is run as a program, not when imported as a library. Because of that, the more function in this file is executed automatically by the last line in the file when this script is run as a top-level program, not when it is imported elsewhere. This simple trick turns out to be one key to writing reusable script code: by coding program logic as functions instead of top-level code, it can also be imported and reused in other scripts.

The upshot is that we can either run more.py by itself, or import and call its more function elsewhere. When running the file as a top-level program, we list the name of a file to be read and paged on the command line: as we'll describe fully later in this chapter, words typed in the command used to start a program show up in the built-in sys.argv list in Python. For example, here is the script file in action paging itself (be sure to type this command line in your PP2E\System directory, or it won't find the input file; I'll explain why later):

C:\...\PP2E\System>python more.py more.py
#########################################################
# split and interactively page a string or file of text;
#########################################################
 
import string
 
def more(text, numlines=15):
 lines = string.split(text, '\n')
 while lines:
 chunk = lines[:numlines]
More?y
 lines = lines[numlines:]
 for line in chunk: print line
 if lines and raw_input('More?') not in ['y', 'Y']: break
 
if __name__ == '__main__':
 import sys # when run, not imported
 more(open(sys.argv[1]).read( ), 10) # page contents of file on cmdline

When the more.py file is imported, we pass an explicit string to its more function, and this is exactly the sort of utility we need for documentation text. Running this utility on the sys module's documentation string gives us a bit more information about what's available to scripts, in human-readable form:

>>> from more import more
>>> more(sys.__doc__)
This module provides access to some objects used or maintained by the
interpreter and to functions that interact strongly with the interpreter.
 
Dynamic objects:
 
argv -- command line arguments; argv[0] is the script pathname if known
path -- module search path; path[0] is the script directory, else ''
modules -- dictionary of loaded modules
exitfunc -- you may set this to a function to be called when Python exits
 
stdin -- standard input file object; used by raw_input( ) and input( )
stdout -- standard output file object; used by the print statement
stderr -- standard error object; used for error messages
 By assigning another file object (or an object that behaves like a file)
 to one of these, it is possible to redirect all of the interpreter's I/O.
More?

Pressing "y" (and the Enter key) here makes the function display the next few lines of documentation, and then prompt again unless you've run past the end of the lines list. Try this on your own machine to see what the rest of the module's documentation string looks like.

2.3.7 Python Library Manuals

If that still isn't enough detail, your next step is to read the Python library manual's entry for sys to get the full story. All of Python's standard manuals ship as HTML pages, so you should be able to read them in any web browser you have on your computer. They are available on this book's CD (view CD-ROM content online at http://examples.oreilly.com/python2), and are installed with Python on Windows, but here are a few simple pointers:

· On Windows, click the Start button, pick Programs, select the Python entry there, and then choose the manuals item. The manuals should magically appear on your display within a browser like Internet Explorer.

· On Linux, you may be able to click on the manuals' entries in a file explorer, or start your browser from a shell command line and navigate to the library manual's HTML files on your machine.

· If you can't find the manuals on your computer, you can always read them online. Go to Python's web site, http://www.python.org, and follow the documentation links.

However you get started, be sure to pick the "Library" manual for things like sys; Python's standard manual set also includes a short tutorial, language reference, extending references, and more.

2.3.8 Commercially Published References

At the risk of sounding like a marketing droid, I should mention that you can also purchase the Python manual set, printed and bound; see the book information page at http://www.python.org for details and links. Commercially published Python reference books are also available today, including Python Essential Reference (New Riders Publishing) and Python Pocket Reference (O'Reilly). The former is more complete and comes with examples, but the latter serves as a convenient memory-jogger once you've taken a library tour or two.[2] Also watch for O'Reilly's upcoming book Python Standard Library.

2.4 The sys Module

On to module details. As mentioned earlier, the sys and os modules form the core of much of Python's system-related toolset. Let's now take a quick, interactive tour through some of the tools in these two modules, before applying them in bigger examples.

2.4.1 Platforms and Versions

Like most modules, sys includes both informational names and functions that take action. For instance, its attributes give us the name of the underlying operating system the platform code is running on, the largest possible integer on this machine, and the version number of the Python interpreter running our code:

C:\...\PP2E\System>python
>>> import sys
>>> sys.platform, sys.maxint, sys.version
('win32', 2147483647, '1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)]')
>>>
>>> if sys.platform[:3] == 'win': print 'hello windows'
...
hello windows

If you have code that must act differently on different machines, simply test the sys.platform string as done here; although most of Python is cross-platform, nonportable tools are usually wrapped in if tests like the one here. For instance, we'll see later that program launch and low-level console interaction tools vary per platform today -- simply test sys.platform to pick the right tool for the machine your script is running on.

2.4.2 The Module Search Path

The sys module also lets us inspect the module search path both interactively and within a Python program. sys.path is a list of strings representing the true search path in a running Python interpreter. When a module is imported, Python scans this list from left to right, searching for the module's file on each directory named in the list. Because of that, this is the place to look to verify that your search path is really set as intended.[3]

The sys.path list is simply initialized from your PYTHONPATH setting plus system defaults, when the interpreter is first started up. In fact, you'll notice quite a few directories that are not on your PYTHONPATH if you inspect sys.path interactively -- it also includes an indicator for the script's home directory (an empty string -- something I'll explain in more detail after we meet os.getcwd), and a set of standard library directories that may vary per installation:

>>> sys.path 
['', 'C:\\PP2ndEd\\examples',  ...plus standard paths deleted... ] 

Surprisingly, sys.path can actually be changed by a program too -- a script can use list operations like append, del, and the like to configure the search path at runtime. Python always uses the current sys.path setting to import, no matter what you've changed it to be:

>>> sys.path.append(r'C:\mydir') 
>>> sys.path 
['', 'C:\\PP2ndEd\\examples',  ...more deleted... , 'C:\\mydir']

Changing sys.path directly like this is an alternative to setting your PYTHONPATH shell variable, but not a very good one -- changes to sys.path are retained only until the Python process ends, and must be remade every time you start a new Python program or session.

Windows Directory Paths

Because backslashes normally introduce escape code sequences in Python strings, Windows users should be sure to either double up on backslashes when using them in DOS directory path strings (e.g., in "C:\\dir", \\ is an escape sequence that really means \), or use raw string constants to retain backslashes literally (e.g., r"C:\dir").

If you inspect directory paths on Windows (as in the sys.path interaction listing), Python prints double \\ to mean a single \. Technically, you can get away with a single \ in a string if it is followed by a character Python does not recognize as the rest of an escape sequence, but doubles and raw strings are usually easier than memorizing escape code tables.

Also note that most Python library calls accept either forward ( / ) or backward ( \ ) slashes as directory path separators, regardless of the underlying platform. That is, / usually works on Windows too, and aids in making scripts portable to Unix. Tools in the os and os.path modules, described later in this chapter, further aid in script path portability.

2.4.3 The Loaded Modules Table

The sys module also contains hooks into the interpreter; sys.modules, for example, is a dictionary containing one name:module entry for every module imported in your Python session or program (really, in the calling Python process):

>>> sys.modules
{'os.path': <module 'ntpath' from 'C:\Program Files\Python\Lib\ntpath.pyc'>,...
 
>>> sys.modules.keys( )
['os.path', 'os', 'exceptions', '__main__', 'ntpath', 'strop', 'nt', 'sys', 
'__builtin__', 'site', 'signal', 'UserDict', 'string', 'stat']
>>>
>>> sys
<module 'sys' (built-in)>
>>> sys.modules['sys']
<module 'sys' (built-in)>

We might use such a hook to write programs that display or otherwise process all the modules loaded by a program (just iterate over the keys list of sys.modules). sys also exports tools for getting an object's reference count used by Python's garbage collector (getrefcount), checking which modules are built in to this Python (builtin_module_names), and more.

2.4.4 Exception Details

Some of the sys module's attributes allow us to fetch all the information related to the most recently raised Python exception. This is handy if we want to process exceptions in a more generic fashion. For instance, the sys.exc_info function returns the latest exception's type, value, and traceback object:

>>> try:
... raise IndexError
... except:
... print sys.exc_info( )
...
(<class exceptions.IndexError at 7698d0>, <exceptions.IndexError instance at
797140>, <traceback object at 7971a0>)

We might use such information to format our own error message to display in a GUI pop-up window or HTML web page (recall that by default, uncaught exceptions terminate programs with a Python error display). Portability note -- the most recent exception type, value, and traceback objects are also available via other names:

>>> try:
... raise TypeError, "Bad Thing"
... except:
... print sys.exc_type, sys.exc_value
...
exceptions.TypeError Bad Thing

But these names represent a single, global exception, and are not specific to a particular thread (threads are covered in the next chapter). If you mean to raise and catch exceptions in multiple threads, exc_info( ) provides thread-specific exception details.

2.4.5 Other sys Module Exports

The sys module exports additional tools we will meet in the context of larger topics and examples later in this chapter and book. For instance:

· Command-line arguments show up as a list of strings called sys.argv

· Standard streams are available as stdin, stdout, and stderr

· Program exit can be forced with sys.exit calls

Since these all lead us to bigger topics, though, we cover them in sections of their own later in this and the next chapters.

2.5 The os Module

As mentioned, os contains all the usual operating-system calls you may have used in your C programs and shell scripts. Its calls deal with directories, processes, shell variables, and the like. Technically, this module provides POSIX tools -- a portable standard for operating-system calls -- along with platform-independent directory processing tools as nested module os.path. Operationally, os serves as a largely portable interface to your computer's system calls: scripts written with os and os.path can usually be run on any platform unchanged.

In fact, if you read the os module's source code, you'll notice that it really just imports whatever platform-specific system module you have on your computer (e.g., nt, mac, posix). See the file os.py in the Python source library directory -- it simply runs a from* statement to copy all names out of a platform-specific module. By always importing os instead of platform-specific modules, though, your scripts are mostly immune to platform implementation differences.

2.5.1 The Big os Lists

Let's take a quick look at the basic interfaces in os. If you inspect this module's attributes interactively, you get a huge list of names that will vary per Python release, will likely vary per platform, and isn't incredibly useful until you've learned what each name means:

>>> import os
>>> dir(os)
['F_OK', 'O_APPEND', 'O_BINARY', 'O_CREAT', 'O_EXCL', 'O_RDONLY', 'O_RDWR', 
'O_TEXT', 'O_TRUNC', 'O_WRONLY', 'P_DETACH', 'P_NOWAIT', 'P_NOWAITO',
'P_OVERLAY', 'P_WAIT', 'R_OK', 'UserDict', 'W_OK', 'X_OK', '_Environ',
'__builtins__', '__doc__', '__file__', '__name__', '_execvpe', '_exit',
'_notfound', 'access', 'altsep', 'chdir', 'chmod', 'close', 'curdir', 
'defpath', 'dup', 'dup2', 'environ', 'error', 'execl', 'execle', 'execlp',
'execlpe', 'execv', 'execve', 'execvp', 'execvpe', 'fdopen', 'fstat', 'getcwd',
'getpid', 'i', 'linesep', 'listdir', 'lseek', 'lstat', 'makedirs', 'mkdir',
'name', 'open', 'pardir', 'path', 'pathsep', 'pipe', 'popen', 'putenv', 'read',
'remove', 'removedirs', 'rename', 'renames', 'rmdir', 'sep', 'spawnv', 
'spawnve', 'stat', 'strerror', 'string', 'sys', 'system', 'times', 'umask', 
'unlink', 'utime', 'write']

Besides all of these, the nested os.path module exports even more tools, most of which are related to processing file and directory names portably:

>>> dir(os.path)
['__builtins__', '__doc__', '__file__', '__name__', 'abspath', 'basename', 
'commonprefix', 'dirname', 'exists', 'expanduser', 'expandvars', 'getatime',
'getmtime', 'getsize', 'isabs', 'isdir', 'isfile', 'islink', 'ismount', 'join',
'normcase', 'normpath', 'os', 'split', 'splitdrive', 'splitext', 'splitunc',
'stat', 'string', 'varchars', 'walk']

2.5.2 Administrative Tools

Just in case those massive listings aren't quite enough to go on, let's experiment with some of the simpler os tools interactively. Like sys, the os module comes with a collection of informational and administrative tools:

>>> os.getpid( )
-510737
>>> os.getcwd( )
'C:\\PP2ndEd\\examples\\PP2E\\System'
 
>>> os.chdir(r'c:\temp')
>>> os.getcwd( )
'c:\\temp'

As shown here, the os.getpid function gives the calling process's process ID (a unique system-defined identifier for a running program), and os.getcwd returns the current working directory. The current working directory is where files opened by your script are assumed to live, unless their names include explicit directory paths. That's why I told you earlier to run the following command in the directory where more.py lives:

C:\...\PP2E\System>python more.py more.py

The input filename argument here is given without an explicit directory path (though you could add one to page files in another directory). If you need to run in a different working directory, call the os.chdir function to change to a new directory; your code will run relative to the new directory for the rest of the program (or until the next os.chdir call). This chapter has more to say about the notion of a current working directory, and its relation to module imports, when it explores script execution context later.

2.5.3 Portability Constants

The os module also exports a set of names designed to make cross-platform programming simpler. The set includes platform-specific settings for path and directory separator characters, parent and current directory indicators, and the characters used to terminate lines on the underlying computer:[4]

>>> os.pathsep, os.sep, os.pardir, os.curdir, os.linesep

(';', '\\', '..', '.', '\015\012')

Name os.sep whatever character is used to separate directory components on the platform Python is running on; it is automatically preset to "\" on Windows, "/" for POSIX machines, and ":" on the Mac. Similarly, os.pathsep provides the character that separates directories on directory lists -- ":" for POSIX and ";" for DOS and Windows. By using such attributes when composing and decomposing system-related strings in our scripts, they become fully portable. For instance, a call of the form string.split(dirpath,os.sep) will correctly split platform-specific directory names into components, even though dirpath may look like "dir\dir" on Windows, "dir/dir" on Linux, and "dir:dir" on Macintosh.

2.5.4 Basic os.path Tools

The nested module os.path provides a large set of directory-related tools of its own. For example, it includes portable functions for tasks such as checking a file's type (isdir, isfile, and others), testing file existence (exists), and fetching the size of a file by name (getsize):

>>> os.path.isdir(r'C:\temp'), os.path.isfile(r'C:\temp')
(1, 0)
>>> os.path.isdir(r'C:\config.sys'), os.path.isfile(r'C:\config.sys')
(0, 1)
>>> os.path.isdir('nonesuch'), os.path.isfile('nonesuch')
(0, 0)
 
>>> os.path.exists(r'c:\temp\data.txt')
0
>>> os.path.getsize(r'C:\autoexec.bat')
260

The os.path.isdir and os.path.isfile calls tell us whether a filename is a directory or a simple file; both return (false) if the named file does not exist. We also get calls for splitting and joining directory path strings, which automatically use the directory name conventions on the platform on which Python is running:

>>> os.path.split(r'C:\temp\data.txt')
('C:\\temp', 'data.txt')
>>> os.path.join(r'C:\temp', 'output.txt')
'C:\\temp\\output.txt'
 
>>> name = r'C:\temp\data.txt'  # Windows paths
>>> os.path.basename(name), os.path.dirname(name)
('data.txt', 'C:\\temp')
 
>>> name = '/home/lutz/temp/data.txt'  # Unix-style paths
>>> os.path.basename(name), os.path.dirname(name)
('data.txt', '/home/lutz/temp')
 
>>> os.path.splitext(r'C:\PP2ndEd\examples\PP2E\PyDemos.pyw')
('C:\\PP2ndEd\\examples\\PP2E\\PyDemos', '.pyw')

Call os.path.split separates a filename from its directory path, and os.path.join puts them back together -- all in entirely portable fashion, using the path conventions of the machine on which they are called. The basename and dirname calls here simply return the second and first items returned by a split as a convenience, and splitext strips the file extension (after the last "."). This module also has an abspath call that portably returns the absolute full directory pathname of a file; it accounts for adding the current directory, ".." parents, and more:

>>> os.getcwd( )
'C:\\PP2ndEd\\cdrom\\WindowsExt'
>>> os.path.abspath('temp')  # expand to full path name
'C:\\PP2ndEd\\cdrom\\WindowsExt\\temp'
>>> os.path.abspath(r'..\examples')  # relative paths expanded
'C:\\PP2ndEd\\examples'
>>> os.path.abspath(r'C:\PP2ndEd\chapters') # absolute paths unchanged
'C:\\PP2ndEd\\chapters'
>>> os.path.abspath(r'C:\temp\spam.txt')  # ditto for file names
'C:\\temp\\spam.txt'
>>> os.path.abspath('')  # empty string means the cwd
'C:\\PP2ndEd\\cdrom\\WindowsExt'

Because filenames are relative to the current working directory when they aren't fully specified paths, the os.path.abspath function helps if you want to show users what directory is truly being used to store a file. On Windows, for example, when GUI-based programs are launched by clicking on file explorer icons and desktop shortcuts, the execution directory of the program is the clicked file's home directory, but that is not always obvious to the person doing the clicking; printing a file's abspath can help.

2.5.5 Running Shell Commands from Scripts

The os module is also the place where we run shell commands from within Python scripts. This concept is intertwined with others we won't cover until later in this chapter, but since this a key concept employed throughout this part of the book, let's take a quick first look at the basics here. Two os functions allow scripts to run any command line that you can type in a console window:

os.system

Run a shell command from a Python script

os.popen

Run a shell command and connect to its input or output streams

2.5.5.1 What's a shell command?

To understand the scope of these calls, we need to first define a few terms. In this text the term shell means the system that reads and runs command-line strings on your computer, and shell command means a command-line string that you would normally enter at your computer's shell prompt.

For example, on Windows, you can start an MS-DOS console window and type DOS commands there -- things like dir to get a directory listing, type to view a file, names of programs you wish to start, and so on. DOS is the system shell, and commands like dir and type are shell commands. On Linux, you can start a new shell session by opening an xterm window and typing shell commands there too -- ls to list directories, cat to view files, and so on. There are a variety of shells available on Unix (e.g., csh, ksh), but they all read and run command lines. Here are two shell commands typed and run in an MS-DOS console box on Windows:

C:\temp>dir /B    ...type a shell command-line
about-pp.html  ...its output shows up here
python1.5.tar.gz  ...DOS is the shell on Windows
about-pp2e.html
about-ppr2e.html
newdir
 
C:\temp>type helloshell.py 
# a Python program
print 'The Meaning of Life'
2.5.5.2 Running shell commands

None of this is directly related to Python, of course (despite the fact that Python command-line scripts are sometimes confusingly called "shell tools"). But because the os module's system and popen calls let Python scripts run any sort of command that the underlying system shell understands, our scripts can make use of every command-line tool available on the computer, whether it's coded in Python or not. For example, here is some Python code that runs the two DOS shell commands typed at the shell prompt shown previously:

C:\temp>python
>>> import os
>>> os.system('dir /B')
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir
0
 
>>> os.system('type helloshell.py')
# a Python program
print 'The Meaning of Life'
0

The "0"s at the end here are just the return values of the system call itself. The system call can be used to run any command line that we could type at the shell's prompt (here, C:\temp>). The command's output normally shows up in the Python session's or program's standard output stream.

2.5.5.3 Communicating with shell commands

But what if we want to grab a command's output within a script? The os.system call simply runs a shell command line, but os.popen also connects to the standard input or output streams of the command -- we get back a file-like object connected to the command's output by default (if we pass a "w" mode flag to popen, we connect to the command's input stream instead). By using this object to read the output of a command spawned with popen, we can intercept the text that would normally appear in the console window where a command line is typed:

>>> open('helloshell.py').read( )
"# a Python program\012print 'The Meaning of Life'\012"
 
>>> text = os.popen('type helloshell.py').read( )
>>> text
"# a Python program\012print 'The Meaning of Life'\012"
 
>>> listing = os.popen('dir /B').readlines( )
>>> listing
['about-pp.html\012', 'python1.5.tar.gz\012', 'helloshell.py\012', 
'about-pp2e.html\012', 'about-ppr2e.html\012', 'newdir\012']

Here, we first fetch a file's content the usual way (using Python files), then as the output of a shell type command. Reading the output of a dir command lets us get a listing of files in a directory which we can then process in a loop (we'll meet other ways to obtain such a list later in this chapter). So far, we've run basic DOS commands; because these calls can run any command line that we can type at a shell prompt, they can also be used to launch other Python scripts:

>>> os.system('python helloshell.py') # run a Python program
The Meaning of Life
0
>>> output = os.popen('python helloshell.py').read( )
>>> output
'The Meaning of Life\012'

In all of these examples, the command-line strings sent to system and popen are hardcoded, but there's no reason Python programs could not construct such strings at runtime using normal string operations (+, %, etc.). Given that commands can be dynamically built and run this way, system and popen turn Python scripts into flexible and portable tools for launching and orchestrating other programs. For example, a Python test "driver" script can be used to run programs coded in any language (e.g., C++, Java, Python) and analyze their outputs. We'll explore such a script in Section 4.4 in Chapter 4.

2.5.5.4 Shell command limitations

You should keep in mind two limitations of system and popen. First, although these two functions themselves are fairly portable, their use is really only as portable as the commands that they run. The preceding examples that run DOS dir and type shell commands, for instance, work only on Windows, and would have to be changed to run ls and cat commands on Unix-like platforms. As I wrote this, the popen call on Windows worked for command-line programs only; it failed when called from a program running on Windows with any sort of user interface (e.g., under the IDLE Python development GUI). This has been improved in the Python 2.0 release -- popen now works much better on Windows -- but this fix naturally works only on machines with the latest version of Python installed.

Second, it is important to remember that running Python files as programs this way is very different, and generally much slower, than importing program files and calling functions they define. When os.system and os.popen are called, they must start a brand-new independent program running on your operating system (on Unix-like platforms, they run the command in a newly forked process). When importing a program file as a module, the Python interpreter simply loads and runs the file's code in the same process, to generate a module object. No other program is spawned along the way.[5]

There are good reasons to build systems as separate programs too, and we'll later explore things like command-line arguments and streams that allow programs to pass information back and forth. But for most purposes, imported modules are a faster and more direct way to compose systems.

If you plan to use these calls in earnest, you should also know that the os.system call normally blocks (that is, pauses) its caller until the spawned command line exits. On Linux and Unix-like platforms, the spawned command can generally be made to run independently and in parallel with the caller, by adding an & shell background operator at the end of the command line:

os.system("python program.py arg arg &")

On Windows, spawning with a DOS start command will usually launch the command in parallel too:

os.system("start program.py arg arg")

The os.popen call generally does not block its caller -- by definition, the caller must be able to read or write the file object returned -- but callers may still occasionally become blocked under both Windows and Linux if the pipe object is closed (e.g., when garbage is collected) before the spawned program exits, or the pipe is read exhaustively (e.g., with its read( ) method). As we will see in the next chapter, the Unix os.fork/exec and Windows os.spawnv calls can also be used to run parallel programs without blocking.

Because the os system and popen calls also fall under the category of program launchers, stream redirectors, and cross-process communication devices, they will show up again in later parts of this and the following chapters, so we'll defer further details for the time being.

2.5.6 Other os Module Exports

Since most other os module tools are even more difficult to appreciate outside the context of larger application topics, we'll postpone a deeper look until later sections. But to let you sample the flavor of this module, here is a quick preview for reference. Among the os module's other weapons are these:

os.environ

Fetch and set shell environment variables

os.fork

Spawn a new child process on Unix

os.pipe

Communicate between programs

os.execlp

Start new programs

os.spawnv

Start new programs on Windows

os.open

Open a low-level descriptor-based file

os.mkdir

Create a new directory

os.mkfifo

Create a new named pipe

os.stat

Fetch low-level file information

os.remove

Delete a file by its pathname

os.path.walk

Apply a function to files in an entire directory tree

And so on. One caution up front: the os module provides a set of file open, read, and write calls, but these all deal with low-level file access and are entirely distinct from Python's built-in stdio file objects that we create with the built-in open function. You should normally use the built-in open function (not the os module) for all but very special file-processing needs.

Throughout this chapter, we will apply sys and os tools such as these to implement common system-level tasks, but this book doesn't have space to provide an exhaustive list of the contents of the modules we meet along the way. If you have not already done so, you should become acquainted with the contents of modules like os and sys by consulting the Python library manual. For now, let's move on to explore additional system tools, in the context of broader system programming concepts.

2.6 Script Execution Context

Python scripts don't run in a vacuum. Depending on platforms and startup procedures, Python programs may have all sorts of enclosing context -- information automatically passed-in to the program by the operating system when the program starts up. For instance, scripts have access to the following sorts of system-level inputs and interfaces:

Current working directory

os.getcwd gives access to the directory from which a script is started, and many file tools use its value implicitly.

Command-line arguments

sys.argv gives access to words typed on the command line used to start the program that serve as script inputs.

Shell variables

os.environ provides an interface to names assigned in the enclosing shell (or a parent program) and passed in to the script.

Standard streams

sys.stdin, stdout, and stderr export the three input/output streams that are at the heart of command-line shell tools.

Such tools can serve as inputs to scripts, configuration parameters, and so on. In the next few sections, we will explore these context tools -- both their Python interfaces and their typical roles.

2.7 Current Working Directory

The notion of the current working directory (CWD) turns out to be a key concept in some scripts' execution: it's always the implicit place where files processed by the script are assumed to reside, unless their names have absolute directory paths. As we saw earlier, os.getcwd lets a script fetch the CWD name explicitly, and os.chdir allows a script to move to a new CWD.

Keep in mind, though, that filenames without full pathnames map to the CWD, and have nothing to do with your PYTHONPATH setting. Technically, the CWD is always where a script is launched from, not the directory containing the script file. Conversely, imports always first search the directory containing the script, not the CWD (unless the script happens to also be located in the CWD). Since this distinction is subtle and tends to trip up beginners, let's explore it in more detail.

2.7.1 CWD, Files, and Import Paths

When you run a Python script by typing a shell command line like python dir1\dir2\file.py, the CWD is the directory you were in when you typed this command, not dir1\dir2. On the other hand, Python automatically adds the identity of the script's home directory to the front of the module search path, such that file.py can always import other files in dir1\dir2, no matter where it is run from. To illustrate, let's write a simple script to echo both its CWD and module search path:

C:\PP2ndEd\examples\PP2E\System>type whereami.py
import os, sys
print 'my os.getcwd =>', os.getcwd( ) # show my cwd execution dir
print 'my sys.path =>', sys.path[:6] # show first 6 import paths
raw_input( ) # wait for keypress if clicked

Now, running this script in the directory in which it resides sets the CWD as expected, and adds an empty string ('') to the front of the module search path, to designate the CWD (we met the sys.path module search path earlier):

C:\PP2ndEd\examples\PP2E\System>set PYTHONPATH=C:\PP2ndEd\examples
C:\PP2ndEd\examples\PP2E\System>python whereami.py
my os.getcwd => C:\PP2ndEd\examples\PP2E\System
my sys.path => ['', 'C:\\PP2ndEd\\examples', 'C:\\Program Files\\Python
\\Lib\\plat-win', 'C:\\Program Files\\Python\\Lib', 'C:\\Program Files\\
Python\\DLLs', 'C:\\Program Files\\Python\\Lib\\lib-tk']

But if we run this script from other places, the CWD moves with us (it's the directory where we type commands), and Python adds a directory to the front of the module search path that allows the script to still see files in its own home directory. For instance, when running from one level up (".."), the "System" name added to the front of sys.path will be the first directory Python searches for imports within whereami.py ; it points imports back to the directory containing the script run. Filenames without complete paths, though, will be mapped to the CWD (C:\PP2ndEd\examples\PP2E ), not the System subdirectory nested there:

C:\PP2ndEd\examples\PP2E\System>cd .. 
C:\PP2ndEd\examples\PP2E>python System\whereami.py 
my os.getcwd => C:\PP2ndEd\examples\PP2E
my sys.path => ['System', 'C:\\PP2ndEd\\examples', ... rest same... ]
 
C:\PP2ndEd\examples\PP2E>cd .. 
C:\PP2ndEd\examples>python PP2E\System\whereami.py 
my os.getcwd => C:\PP2ndEd\examples
my sys.path => ['PP2E\\System', 'C:\\PP2ndEd\\examples', ... rest same...  ]
 
C:\PP2ndEd\examples\PP2E\System>cd PP2E\System\App 
C:\PP2ndEd\examples\PP2E\System\App>python ..\whereami.py 
my os.getcwd => C:\PP2ndEd\examples\PP2E\System\App
my sys.path => ['..', 'C:\\PP2ndEd\\examples', ... rest same...  ]

The net effect is that filenames without directory paths in a script will be mapped to the place where the command was typed (os.getcwd), but imports still have access to the directory of the script being run (via the front of sys.path). Finally, when a file is launched by clicking its icon, the CWD is just the directory that contains the clicked file. The following output, for example, appears in a new DOS console box, when whereami.py is double-clicked in Windows explorer:

my os.getcwd => C:\PP2ndEd\examples\PP2E\System
my sys.path => ['C:\\PP2NDED\\EXAMPLES\\PP2E\\SYSTEM', 'C:\\PP2ndEd\\examples',
'C:\\Program Files\\Python\\Lib\\plat-win', 'C:\\Program Files\\Python\\Lib',
'C:\\Program Files\\Python\\DLLs']

In this case, both the CWD used for filenames and the first import search directory are the directory containing the script file. This all usually works out just as you expect, but there are two pitfalls to avoid:

· Filenames might need to include complete directory paths if scripts cannot be sure from where they will be run.

· Command-line scripts cannot use the CWD to gain import visibility to files not in their own directories; instead, use PYTHONPATH settings and package import paths to access modules in other directories.

For example, files in this book can always import other files in their own home directories without package path imports, regardless of how they are run (import filehere) but must go through the PP2E package root to find files anywhere else in the examples tree (from PP2E.dir1.dir2 import filethere) even if they are run from the directory containing the desired external module. As usual for modules, the PP2E\dir1\dir2 directory name could also be added to PYTHONPATH to make filethere visible everywhere without package path imports (though adding more directories to PYTHONPATH increases the likelihood of name clashes). In either case, though, imports are always resolved to the script's home directory or other Python search path settings, not the CWD.

2.7.2 CWD and Command Lines

This distinction between the CWD and import search paths explains why many scripts in this book designed to operate in the current working directory (instead of one whose name is passed in) are run with command lines like this:

C:\temp>python %X%\PyTools\cleanpyc-py.py    process cwd

In this example, the Python script file itself lives in the directory C:\PP2ndEd\examples\PP2E\PyTools, but because it is run from C:\temp, it processes the files located in C:\temp (i.e., in the CWD, not in the script's home directory). To process files elsewhere with such a script, simply cd to the directory to be processed to change the CWD:

C:\temp>cd C:\PP2nEd\examples 
C:\PP2ndEd\examples>python %X%\PyTools\cleanpyc-py.py    process cwd

Because the CWD is always implied, a cd tells the script which directory to process in no less certain terms that passing a directory name to the script explicitly like this:

C:\...\PP2E\PyTools>python find.py *.py C:\temp    process named dir

In this command line, the CWD is the directory containing the script to be run (notice that the script filename has no directory path prefix); but since this script processes a directory named explicitly on the command line (C:\temp), the CWD is irrelevant. Finally, if we want to run such a script located in some other directory to process files located in some other directory, we can simply give directory paths to both:

C:\temp>python %X%\PyTools\find.py *.cxx C:\PP2ndEd\examples\PP2E

Here, the script has import visibility to files in its PP2E\PyTools home directory and processes files in the PP2E root, but the CWD is something else entirely (C:\temp). This last form is more to type, of course, but watch for a variety of CWD and explicit script-path command lines like these in this book.

Whenever you see a %X% in command lines like those in the preceding examples, it refers to the value of the shell environment variable named X. It's just a shorthand for the full directory pathname of the PP2E book examples package root directory, which I use to point to scripts' files. On my machines, it is preset in my PP2E\Config setup-pp* files like this:

set X=C:\PP2ndEd\examples\PP2E --DOS
setenv X /home/mark/PP2ndEd/examples/PP2E --Unix/csh

That is, it is assigned and expanded to the directory where PP2E lives on the system. See the Config\setup-pp* files for more details, and see later in this chapter for more on shell variables. You can instead type full paths everywhere you see %X% in this book, but your fingers and your keyboard are probably both better off if you set X to your examples root.

 

2.8 Command-Line Arguments

The sys module is also where Python makes available the words typed on the command used to start a Python script. These words are usually referred to as command-line arguments, and show up in sys.argv, a built-in list of strings. C programmers may notice its similarity to the C "argv" array (an array of C strings). It's not much to look at interactively, because no command-line arguments are passed to start up Python in this mode:

>>> sys.argv
['']

To really see what arguments are about, we need to run a script from the shell command line. Example 2-2 shows an unreasonably simple one that just prints the argv list for inspection.

Example 2-2. PP2E\System\testargv.py
import sys
print sys.argv

Running this script prints the command-line arguments list; note that the first item is always the name of the executed Python script file itself, no matter how the script was started (see Executable Scripts on Unix later in this chapter):

C:\...\PP2E\System>python testargv.py
['testargv.py']
 
C:\...\PP2E\System>python testargv.py spam eggs cheese
['testargv.py', 'spam', 'eggs', 'cheese']
 
C:\...\PP2E\System>python testargv.py -i data.txt -o results.txt
['testargv.py', '-i', 'data.txt', '-o', 'results.txt']

The last command here illustrates a common convention. Much like function arguments, command-line options are sometimes passed by position, and sometimes by name using a "-name value" word pair. For instance, the pair -i data.txt means the -i option's value is data.txt (e.g., an input filename). Any words can be listed, but programs usually impose some sort of structure on them.

Command-line arguments play the same role in programs that function arguments do in functions: they are simply a way to pass information to a program that can vary per program run. Because they don't have to be hardcoded, they allow scripts to be more generally useful. For example, a file-processing script can use a command-line argument as the name of the file it should process; see the more.py script we met in Example 2-1 for a prime example. Other scripts might accept processing mode flags, Internet addresses, and so on.

Once you start using command-line arguments regularly, though, you'll probably find it inconvenient to keep writing code that fishes through the list looking for words. More typically, programs translate the arguments list on startup into structures more conveniently processed. Here's one way to do it: the script in Example 2-3 scans the argv list looking for -optionname optionvalue word pairs, and stuffs them into a dictionary by option name for easy retrieval.

Example 2-3. PP2E\System\testargv2.py
# collect command-line options in a dictionary
 
def getopts(argv):
 opts = {}
 while argv:
 if argv[0][0] == '-': # find "-name value" pairs
 opts[argv[0]] = argv[1] # dict key is "-name" arg
 argv = argv[2:] 
 else:
 argv = argv[1:]
 return opts
 
if __name__ == '__main__':
 from sys import argv # example client code
 myargs = getopts(argv)
 if myargs.has_key('-i'):
 print myargs['-i']
 print myargs

You might import and use such a function in all your command-line tools. When run by itself, this file just prints the formatted argument dictionary:

C:\...\PP2E\System>python testargv2.py
{}
 
C:\...\PP2E\System>python testargv2.py -i data.txt -o results.txt
data.txt
{'-o': 'results.txt', '-i': 'data.txt'}

Naturally, we could get much more sophisticated here in terms of argument patterns, error checking, and the like. We could also use standard and more advanced command-line processing tools in the Python library to parse arguments; see module getopt in the library manual for another option. In general, the more configurable your scripts, the more you must invest on command-line processing logic complexity.

Executable Scripts on Unix

Unix and Linux users: you can also make text files of Python source code directly executable by adding a special line at the top with the path to the Python interpreter and giving the file executable permission. For instance, type this code into a text file called "myscript":

#!/usr/bin/python
print 'And nice red uniforms'

The first line is normally taken as a comment by Python (it starts with a #); but when this file is run, the operating system sends lines in this file to the interpreter listed after #! on line 1. If this file is made directly executable with a shell command of the form chmod +x myscript, it can be run directly, without typing python in the command, as though it were a binary executable program:

% myscript a b c
And nice red uniforms

When run this way, sys.argv will still have the script's name as the first word in the list: ["myscript", "a", "b", "c"], exactly as if the script had been run with the more explicit and portable command form python myscript a b c. Making scripts directly executable is really a Unix trick, not a Python feature, but it's worth pointing out that it can be made a bit less machine-dependent by listing the Unix env command at the top instead of a hardcoded path to the Python executable:

#!/usr/bin/env python
print 'Wait for it...'

When coded this way, the operating system will employ your environment variable settings to locate your Python interpreter (your PATH variable, on most platforms). If you run the same script on many machines, you need only change your environment settings on each machine, not edit Python script code. Of course, you can always run Python files with a more explicit command line:

% python myscript a b c

This assumes that the python interpreter program is on your system's search path setting (else you need to type its full path), but works on any Python platform with a command line. Since this is more portable, I generally use this convention in the book's examples, but consult your Unix man pages for more details on any of the topics mentioned here. Even so, these special #! lines will show up in many examples in this book just in case readers want to run them as executables on Unix or Linux; on other platforms, they are simply ignored as Python comments. Note that on Windows NT/2000, you can usually type a script's filename directly (without the "python" word) to make it go too, and you don't have to add a #! line at the top.

 

2.9 Shell Environment Variables

Shell variables, sometimes known as environment variables, are made available to Python scripts as os.environ, a Python dictionary-like object with one entry per variable setting in the shell. Shell variables live outside the Python system; they are often set at your system prompt or within startup files, and typically serve as systemwide configuration inputs to programs.

In fact, by now you should be familiar with a prime example: the PYTHONPATH module search path setting is a shell variable used by Python to import modules. By setting it once in your system startup files, its value is available every time a Python program is run. Shell variables can also be set by programs to serve as inputs to other programs in an application; because their values are normally inherited by spawned programs, they can be used as a simple form of interprocess communication.

2.9.1 Fetching Shell Variables

In Python, the surrounding shell environment becomes a simple preset object, not special syntax. Indexing os.environ by the desired shell variable's name string (e.g., os.environ['USER']) is the moral equivalent of adding a dollar sign before a variable name in most Unix shells (e.g., $USER), using surrounding percent signs on DOS (%USER%), and calling getenv("USER") in a C program. Let's start up an interactive session to experiment:

>>> import os
>>> os.environ.keys( )
['WINBOOTDIR', 'PATH', 'USER', 'PP2HOME', 'CMDLINE', 'PYTHONPATH', 'BLASTER', 
'X', 'TEMP', 'COMSPEC', 'PROMPT', 'WINDIR', 'TMP']
>>> os.environ['TEMP']
'C:\\windows\\TEMP'

Here, the keys method returns a list of variables set, and indexing fetches the value of shell variable TEMP on Windows. This works the same on Linux, but other variables are generally preset when Python starts up. Since we know about PYTHONPATH, let's peek at its setting within Python to verify its content:[6]

>>> os.environ['PYTHONPATH']
'C:\\PP2ndEd\\examples\\Part3;C:\\PP2ndEd\\examples\\Part2;C:\\PP2ndEd\\
examples\\Part2\\Gui;C:\\PP2ndEd\\examples'
>>>
>>> import string
>>> for dir in string.split(os.environ['PYTHONPATH'], os.pathsep):
... print dir
...
C:\PP2ndEd\examples\Part3
C:\PP2ndEd\examples\Part2
C:\PP2ndEd\examples\Part2\Gui
C:\PP2ndEd\examples

PYTHONPATH is a string of directory paths separated by whatever character is used to separate items in such paths on your platform (e.g., ";" on DOS/Window, ":" on Unix and Linux). To split it into its components, we pass string.split a delimiter os.pathsep, a portable setting that gives the proper separator for the underlying machine.

2.9.2 Changing Shell Variables

Like normal dictionaries, the os.environ object supports both key indexing and assignment. As usual, assignments change the value of the key:

>>> os.environ['TEMP'] = r'c:\temp'
>>> os.environ['TEMP']
'c:\\temp'

But something extra happens here. In recent Python releases, values assigned to os.environ keys in this fashion are automatically exported to other parts of the application. That is, key assignments change both the os.environ object in the Python program as well as the associated variable in the enclosing shell environment of the running program's process. Its new value becomes visible to the Python program, all linked-in C modules, and any programs spawned by the Python process. Internally, key assignments to os.environ call os.putenv -- a function that changes the shell variable outside the boundaries of the Python interpreter. To demonstrate this how this works, we need a couple scripts that set and fetch shell variables; the first is shown in Example 2-4.

Example 2-4. PP2E\System\Environment\setenv.py
import os
print 'setenv...',
print os.environ['USER'] # show current shell variable value
 
os.environ['USER'] = 'Brian'  # runs os.putenv behind the scenes
os.system('python echoenv.py')
 
os.environ['USER'] = 'Arthur' # changes passed to spawned programs
os.system('python echoenv.py') # and linked-in C library modules
 
os.environ['USER'] = raw_input('?') 
print os.popen('python echoenv.py').read( ) 

This setenv.py script simply changes a shell variable, USER, and spawns another script that echoes this variable's value, shown in Example 2-5.

Example 2-5. PP2E\System\Environment\echoenv.py
import os
print 'echoenv...', 
print 'Hello,', os.environ['USER']

No matter how we run echoenv.py, it displays the value of USER in the enclosing shell; when run from the command line, this value comes from whatever we've set the variable to in the shell itself:

C:\...\PP2E\System\Environment>set USER=Bob
 
C:\...\PP2E\System\Environment>python echoenv.py
echoenv... Hello, Bob

When spawned by another script like setenv.py, though, echoenv.py gets whatever USER settings its parent program has made:

C:\...\PP2E\System\Environment>python setenv.py
setenv... Bob
echoenv... Hello, Brian
echoenv... Hello, Arthur
?Gumby
echoenv... Hello, Gumby
 
C:\...\PP2E\System\Environment>echo %USER%
Bob

This works the same way on Linux. In general terms, a spawned program always inherits environment settings from its parents. "Spawned" programs are programs started with Python tools such as os.spawnv on Windows, the os.fork/exec combination on Unix and Linux, and os.popen and os.system on a variety of platforms -- all programs thus launched get the environment variable settings that exist in the parent at launch time.[7]

Setting shell variables like this before starting a new program is one way to pass information into the new program. For instance, a Python configuration script might tailor the PYTHONPATH variable to include custom directories, just before launching another Python script; the launched script will have the custom search path because shell variables are passed down to children (in fact, watch for such a launcher script to appear at the end of Chapter 4).

Notice the last command in the preceding example, though -- the USER variable is back to its original value after the top-level Python program exits. Assignments to os.environ keys are passed outside the interpreter and down the spawned programs chain, but never back up to parent program processes (including the system shell). This is also true in C programs that use the putenv library call, and isn't a Python limitation per se. It's also likely to be a nonissue if a Python script is at the top of your application. But keep in mind that shell settings made within a program only endure for that program's run, and that of its spawned children.

 

2.10 Standard Streams

Module sys is also the place where the standard input, output, and error streams of your Python programs live:

>>> for f in (sys.stdin, sys.stdout, sys.stderr): print f
...
<open file '<stdin>', mode 'r' at 762210>
<open file '<stdout>', mode 'w' at 762270>
<open file '<stderr>', mode 'w' at 7622d0>

The standard streams are simply pre-opened Python file objects that are automatically connected to your program's standard streams when Python starts up. By default, they are all tied to the console window where Python (or a Python program) was started. Because the print statement and raw_input functions are really nothing more than user-friendly interfaces to the standard output and input streams, they are similar to using stdout and stdin in sys directly:

>>> print 'hello stdout world'
hello stdout world
 
>>> sys.stdout.write('hello stdout world' + '\n')
hello stdout world
 
>>> raw_input('hello stdin world>')
hello stdin world>spam
'spam'
 
>>> print 'hello stdin world>',; sys.stdin.readline( )[:-1]
hello stdin world>eggs
 
'eggs'

Standard Streams on Windows

Windows users: if you click a .py Python program's filename in a Windows file explorer to start it (or launch it with os.system), a DOS console box automatically pops up to serve as the program's standard stream. If your program makes windows of its own, you can avoid this console pop-up window by naming your program's source-code file with a .pyw extension, not .py. The .pyw extension simply means a .py source file without a DOS pop-up on Windows.

One caveat: in the Python 1.5.2 release, .pyw files can only be run, not imported -- the .pyw is not recognized as a module name. If you want a program to both be run without a DOS console pop-up and be importable elsewhere, you need both .py and .pyw files; the .pyw may simply serve as top-level script logic that imports and calls the core logic in the .py. See Section 9.4 in Chapter 9, for an example.

Also note that because printed output goes to this DOS pop-up when a program is clicked, scripts that simply print text and exit will generate an odd "flash" -- the DOS console box pops up, output is printed into it, and the pop-up goes immediately away (not the most user-friendly of features!). To keep the DOS pop-up box around so you can read printed output, simply add a raw_input( ) call at the bottom of your script to pause for an Enter key press before exiting.

2.10.1 Redirecting Streams to Files and Programs

Technically, standard output (and print) text appears in the console window where a program was started, standard input (and raw_input) text comes from the keyboard, and standard error is used to print Python error messages to the console window. At least that's the default. It's also possible to redirect these streams both to files and other programs at the system shell, and to arbitrary objects within a Python script. On most systems, such redirections make it easy to reuse and combine general-purpose command-line utilities.

2.10.1.1 Redirecting streams to files

Redirection is useful for things like canned (precoded) test inputs: we can apply a single test script to any set of inputs by simply redirecting the standard input stream to a different file each time the script is run. Similarly, redirecting the standard output stream lets us save and later analyze a program's output; for example, testing systems might compare the saved standard output of a script with a file of expected output, to detect failures.

Although it's a powerful paradigm, redirection turns out to be straightforward to use. For instance, consider the simple read-evaluate-print loop program in Example 2-6.

Example 2-6. PP2E\System\Streams\teststreams.py
# read numbers till eof and show squares
 
def interact( ):
 print 'Hello stream world' # print sends to sys.stdout
 while 1:
 try:
  reply = raw_input('Enter a number>') # raw_input reads sys.stdin
 except EOFError:
 break # raises an except on eof
 else: # input given as a string
 num = int(reply)
 print "%d squared is %d" % (num, num ** 2)
 print 'Bye'
 
if __name__ == '__main__': 
 interact( ) # when run, not imported

As usual, the interact function here is automatically executed when this file is run, not when it is imported. By default, running this file from a system command line makes that standard stream appear where you typed the Python command. The script simply reads numbers until it reaches end-of-file in the standard input stream (on Windows, end-of-file is usually the two-key combination Ctrl+Z; on Unix, type Ctrl+D instead[8] ):

C:\...\PP2E\System\Streams>python teststreams.py
Hello stream world
Enter a number>12
12 squared is 144
Enter a number>10
10 squared is 100
Enter a number>

But on both Windows and Unix-like platforms, we can redirect the standard input stream to come from a file with the < filename shell syntax. Here is a command session in a DOS console box on Windows that forces the script to read its input from a text file, input.txt. It's the same on Linux, but replace the DOS type command with a Unix cat command:

C:\...\PP2E\System\Streams>type input.txt
8
6
 
C:\...\PP2E\System\Streams>python teststreams.py < input.txt
Hello stream world
Enter a number>8 squared is 64
Enter a number>6 squared is 36
Enter a number>Bye

Here, the input.txt file automates the input we would normally type interactively -- the script reads from this file instead of the keyboard. Standard output can be similarly redirected to go to a file, with the > filename shell syntax. In fact, we can combine input and output redirection in a single command:

C:\...\PP2E\System\Streams>python teststreams.py < input.txt > output.txt
 
C:\...\PP2E\System\Streams>type output.txt
Hello stream world
Enter a number>8 squared is 64
Enter a number>6 squared is 36
Enter a number>Bye

This time, the Python script's input and output are both mapped to text files, not the interactive console session.

2.10.1.2 Chaining programs with pipes

On Windows and Unix-like platforms, it's also possible to send the standard output of one program to the standard input of another, using the | shell character between two commands. This is usually called a "pipe" operation -- the shell creates a pipeline that connects the output and input of two commands. Let's send the output of the Python script to the standard "more" command-line program's input to see how this works:

C:\...\PP2E\System\Streams>python teststreams.py < input.txt | more
 
Hello stream world
Enter a number>8 squared is 64
Enter a number>6 squared is 36
Enter a number>Bye

Here, teststreams's standard input comes from a file again, but its output (written by print statements) is sent to another program, not a file or window. The receiving program is more -- a standard command-line paging program available on Windows and Unix-like platforms. Because Python ties scripts into the standard stream model, though, Python scripts can be used on both ends -- one Python script's output can always be piped into another Python script's input:

C:\...\PP2E\System\Streams>type writer.py
print "Help! Help! I'm being repressed!"
print 42
 
C:\...\PP2E\System\Streams>type reader.py
print 'Got this" "%s"' % raw_input( )
import sys
data = sys.stdin.readline( )[:-1]
print 'The meaning of life is', data, int(data) * 2
 
C:\...\PP2E\System\Streams>python writer.py | python reader.py
Got this" "Help! Help! I'm being repressed!"
The meaning of life is 42 84

This time, two Python programs are connected. Script reader gets input from script writer; both scripts simply read and write, oblivious to stream mechanics. In practice, such chaining of programs is a simple form of cross-program communications. It makes it easy to reuse utilities written to communicate via stdin and stdout in ways we never anticipated. For instance, a Python program that sorts stdin text could be applied to any data source we like, including the output of other scripts. Consider the Python command-line utility scripts in Examples Example 2-7 and Example 2-8 that sort and sum lines in the standard input stream.

Example 2-7. PP2E\System\Streams\sorter.py
import sys
lines = sys.stdin.readlines( ) # sort stdin input lines,
lines.sort( ) # send result to stdout
for line in lines: print line, # for further processing
Example 2-8. PP2E\System\Streams\adder.py
import sys, string
sum = 0
while 1:
 try:
 line = raw_input( ) # or call sys.stdin.readlines( ):
 except EOFError: # or sys.stdin.readline( ) loop
 break
 else:
 sum = sum + string.atoi(line) # int(line[:-1]) treats 042 as octal
print sum

We can apply such general-purpose tools in a variety of ways at the shell command line, to sort and sum arbitrary files and program outputs:

C:\...\PP2E\System\Streams>type data.txt 
123
000
999
042
 
C:\...\PP2E\System\Streams>python sorter.py < data.txt    sort a file
000
042
123
999
 
C:\...\PP2E\System\Streams>type data.txt | python adder.py    sum program output
1164
 
C:\...\PP2E\System\Streams>type writer2.py 
for data in (123, 0, 999, 42):
 print '%03d' % data
 
C:\...\PP2E\System\Streams>python writer2.py | python sorter.py   sort py output
000
042
123
999
 
C:\...\PP2E\System\Streams>python writer2.py | python sorter.py | python adder.py 
1164

The last command here connects three Python scripts by standard streams -- the output of each prior script is fed to the input of the next via pipeline shell syntax.

If you look closely, you'll notice that sorter reads all of stdin at once with the readlines method, but adder reads one line at a time. If the input source is another program, some platforms run programs connected by pipes in parallel. On such systems, reading line-by-line works better if the data streams being passed about are large -- readers need not wait until writers are completely finished to get busy processing data. Because raw_input just reads stdin, the line-by-line scheme used by adder can always be coded with sys.stdin too:

C:\...\PP2E\System\Streams>type adder2.py
import sys, string
sum = 0
while 1:
 line = sys.stdin.readline( )
 if not line: break
 sum = sum + string.atoi(line[:-1])
print sum

Changing sorter to read line-by-line may not be a big performance boost, though, because the list sort method requires the list to already be complete. As we'll see in Chapter 17, manually coded sort algorithms are likely to be much slower than the Python list sorting method.

2.10.1.3 Redirected streams and user interaction

At the start of the last section, we piped teststreams.py output into the standard more command-line program with a command like this:

C:\...\PP2E\System\Streams>python teststreams.py < input.txt | more

But since we already wrote our own "more" paging utility in Python near the start of this chapter, why not set it up to accept input from stdin too? For example, if we change the last three lines of file more.py listed earlier in this chapter to this:

if __name__ == '__main__': # when run, not when imported
 if len(sys.argv) == 1: # page stdin if no cmd args
 more(sys.stdin.read( ))
 else:
 more(open(sys.argv[1]).read( ))

Then it almost seems as if we should be able to redirect the standard output of teststreams.py into the standard input of more.py :

C:\...\PP2E\System\Streams>python teststreams.py < input.txt | python ..\more.py
Hello stream world
Enter a number>8 squared is 64
Enter a number>6 squared is 36
Enter a number>Bye

This technique works in general for Python scripts. Here, teststreams.py takes input from a file again. And, as in the last section, one Python program's output is piped to another's input -- the more.py script in the parent ("..") directory.

2.10.1.3.1 Reading keyboard input

But there's a subtle problem lurking in the preceding more.py command. Really, chaining only worked there by sheer luck: if the first script's output is long enough for more to have to ask the user if it should continue, the script will utterly fail. The problem is that the augmented more.py uses stdin for two disjoint purposes. It reads a reply from an interactive user on stdin by calling raw_input, but now also accepts the main input text on stdin. When the stdin stream is really redirected to an input file or pipe, we can't use it to input a reply from an interactive user; it contains only the text of the input source. Moreover, because stdin is redirected before the program even starts up, there is no way to know what it meant prior to being redirected in the command line.

If we intend to accept input on stdin and use the console for user interaction, we have to do a bit more. Example 2-9 shows a modified version of the more script that pages the standard input stream if called with no arguments, but also makes use of lower-level and platform-specific tools to converse with a user at a keyboard if needed.

Example 2-9. PP2E\System\moreplus.py
#############################################################
# split and interactively page a string, file, or stream of
# text to stdout; when run as a script, page stdin or file 
# whose name is passed on cmdline; if input is stdin, can't
# use it for user reply--use platform-specific tools or gui;
#############################################################
 
import sys, string
 
def getreply( ):
 """ 
 read a reply key from an interactive user
 even if stdin redirected to a file or pipe
 """
 if sys.stdin.isatty( ): # if stdin is console
 return raw_input('?') # read reply line from stdin 
 else:
  if sys.platform[:3] == 'win': # if stdin was redirected
 import msvcrt # can't use to ask a user 
 msvcrt.putch('?')
 key = msvcrt.getche( ) # use windows console tools
 msvcrt.putch('\n') # getch( ) does not echo key
 return key
 elif sys.platform[:5] == 'linux': # use linux console device 
 print '?', # strip eoln at line end
 console = open('/dev/tty')
 line = console.readline( )[:-1]
 return line
 else:
 print '[pause]' # else just pause--improve me
 import time # see also modules curses, tty
 time.sleep(5) # or copy to temp file, rerun
 return 'y' # or gui popup, tk key bind
 
def more(text, numlines=10):
 """
 split multi-line string to stdout
 """
 lines = string.split(text, '\n')
 while lines:
 chunk = lines[:numlines]
 lines = lines[numlines:]
 for line in chunk: print line
 if lines and getreply( ) not in ['y', 'Y']: break 
 
if __name__ == '__main__':  # when run, not when imported
 if len(sys.argv) == 1: # if no command-line arguments
 more(sys.stdin.read( )) # page stdin, no raw_inputs
 else:
 more(open(sys.argv[1]).read( )) # else page filename argument

Most of the new code in this version shows up in its getreply function. The file isatty method tells us if stdin is connected to the console; if it is, we simply read replies on stdin as before. Unfortunately, there is no portable way to input a string from a console user independent of stdin, so we must wrap the non-stdin input logic of this script in a sys.platform test:

· On Windows, the built-in msvcrt module supplies low-level console input and output calls (e.g., msvcrt.getch( ) reads a single key press).

· On Linux, the system device file named /dev/tty gives access to keyboard input (we can read it as though it were a simple file).

· On other platforms, we simply run a built-in time.sleep call to pause for five seconds between displays (this is not at all ideal, but is better than not stopping at all, and serves until a better nonportable solution can be found).

Of course, we only have to add such extra logic to scripts that intend to interact with console users and take input on stdin. In a GUI application, for example, we could instead pop up dialogs, bind keyboard-press event to run callbacks, and so on (we'll meet GUIs in Chapter 6).

Armed with the reusable getreply function, though, we can safely run our moreplus utility in a variety of ways. As before, we can import and call this module's function directly, passing in whatever string we wish to page:

>>> from moreplus import more
>>> more(open('System.txt').read( ))
This directory contains operating system interface examples.
 
Many of the examples in this unit appear elsewhere in the examples
distribution tree, because they are actually used to manage other
programs. See the README.txt files in the subdirectories here
for pointers.

Also as before, when run with a command-line argument, this script interactively pages through the named file's text:

C:\...\PP2E\System>python moreplus.py System.txt
This directory contains operating system interface examples.
 
Many of the examples in this unit appear elsewhere in the examples
distribution tree, because they are actually used to manage other
programs. See the README.txt files in the subdirectories here
for pointers.
 
C:\...\PP2E\System>python moreplus.py moreplus.py
#############################################################
# split and interactively page a string, file, or stream of
# text to stdout; when run as a script, page stdin or file
# whose name is passed on cmdline; if input is stdin, can't
# use it for user reply--use platform-specific tools or gui;
#############################################################
 
import sys, string
 
def getreply( ):
?n

But now the script also correctly pages text redirected in to stdin from either a file or command pipe, even if that text is too long to fit in a single display chunk. On most shells, we send such input via redirection or pipe operators like these:

C:\...\PP2E\System>python moreplus.py < moreplus.py
#############################################################
# split and interactively page a string, file, or stream of
# text to stdout; when run as a script, page stdin or file
# whose name is passed on cmdline; if input is stdin, can't
# use it for user reply--use platform-specific tools or gui;
#############################################################
 
import sys, string
 
def getreply( ):
?n
 
C:\...\PP2E\System>type moreplus.py | python moreplus.py
#############################################################
# split and interactively page a string, file, or stream of
# text to stdout; when run as a script, page stdin or file
# whose name is passed on cmdline; if input is stdin, can't
# use it for user reply--use platform-specific tools or gui;
#############################################################
 
import sys, string
 
def getreply( ):
?n

This works the same on Linux, but again use the cat command instead of type. Finally, piping one Python script's output into this script's input now works as expected, without botching user interaction (and not just because we got lucky):

C:\......\System\Streams>python teststreams.py < input.txt | python ..\moreplus.py
Hello stream world
Enter a number>8 squared is 64
Enter a number>6 squared is 36
Enter a number>Bye

Here, the standard output of one Python script is fed to the standard input of another Python script located in the parent directory: moreplus.py reads the output of teststreams.py.

All of the redirections in such command lines work only because scripts don't care what standard input and output really are -- interactive users, files, or pipes between programs. For example, when run as a script, moreplus.py simply reads stream sys.stdin; the command-line shell (e.g., DOS on Windows, csh on Linux) attaches such streams to the source implied by the command line before the script is started. Scripts use the preopened stdin and stdout file objects to access those sources, regardless of their true nature.

And for readers keeping count, we have run this single more pager script in four different ways: by importing and calling its function, by passing a filename command-line argument, by redirecting stdin to a file, and by piping a command's output to stdin. By supporting importable functions, command-line arguments, and standard streams, Python system tools code can be reused in a wide variety of modes.

2.10.2 Redirecting Streams to Python Objects

All of the above standard stream redirections work for programs written in any language that hooks into the standard streams, and rely more on the shell's command-line processor than on Python itself. Command-line redirection syntax like < filename and | program is evaluated by the shell, not Python. A more Pythonesque form of redirection can be done within scripts themselves, by resetting sys.stdin and sys.stdout to file-like objects.

The main trick behind this mode is that anything that looks like a file in terms of methods will work as a standard stream in Python. The object's protocol, not the object's specific datatype, is all that matters. That is:

· Any object that provides file-like read methods can be assigned to sys.stdin to make input come from that object's read methods.

· Any object that defines file-like write methods can be assigned to sys.stdout; all standard output will be sent to that object's methods.

Because print and raw_input simply call the write and readline methods of whatever objects sys.stdout and sys.stdin happen to reference, we can use this trick to both provide and intercept standard stream text with objects implemented as classes. Example 2-10 shows a utility module that demonstrates this concept.

Example 2-10. PP2E\System\Streams\redirect.py
##########################################################
# file-like objects that save all standard output text in 
# a string, and provide standard input text from a string;
# redirect runs a passed-in function with its output and
# input streams reset to these file-like class objects;
##########################################################
 
import sys, string # get built-in modules
 
class Output: # simulated output file
 def __init__(self): 
 self.text = '' # empty string when created
 def write(self, string): # add a string of bytes
 self.text = self.text + string 
 def writelines(self, lines): # add each line in a list
 for line in lines: self.write(line)
 
class Input: # simulated input file
 def __init__(self, input=''): # default argument  
 self.text = input # save string when created
 def read(self, *size): # optional argument
 if not size: # read N bytes, or all
 res, self.text = self.text, ''
 else:
 res, self.text = self.text[:size[0]], self.text[size[0]:]
 return res
 def readline(self):
 eoln = string.find(self.text, '\n') # find offset of next eoln
 if eoln == -1: # slice off through eoln
 res, self.text = self.text, ''
 else:
 res, self.text = self.text[:eoln+1], self.text[eoln+1:]
 return res
 
def redirect(function, args, input): # redirect stdin/out
 savestreams = sys.stdin, sys.stdout # run a function object 
 sys.stdin = Input(input) # return stdout text
 sys.stdout = Output( )
 try:
 apply(function, args)
 except:
 sys.stderr.write('error in function! ')
  sys.stderr.write("%s, %s\n" % (sys.exc_type, sys.exc_value))
 result = sys.stdout.text
 sys.stdin, sys.stdout = savestreams
 return result

This module defines two classes that masquerade as real files:

· Output provides the write method protocol expected of output files, but saves all output as it is written, in an in-memory string.

· Input provides the protocol expected of input files, but provides input on demand from an in-memory string, passed in at object construction time.

The redirect function at the bottom of this file combines these two objects to run a single function with input and output redirected entirely to Python class objects. The passed-in function so run need not know or care that its print statements, raw_input calls, and stdin and stdout method calls are talking to a class instead of a real file, pipe, or user.

To demonstrate, import and run the interact function at the heart of the teststreams script we've been running from the shell (to use the redirection utility function, we need to deal in terms of functions, not files). When run directly, the function reads from the keyboard and writes to the screen, just as if it were run as a program without redirection:

C:\...\PP2E\System\Streams>python
>>> from teststreams import interact
>>> interact( )
Hello stream world
Enter a number>2
2 squared is 4
Enter a number>3
3 squared is 9
Enter a number
>>>

Now, let's run this function under the control of the redirection function in redirect.py, and pass in some canned input text. In this mode, the interact function takes its input from the string we pass in ('4\n5\n6\n' -- three lines with explicit end-of-line characters), and the result of running the function is a string containing all the text written to the standard output stream:

>>> from redirect import redirect
>>> output = redirect(interact, ( ), '4\n5\n6\n')
>>> output
'Hello stream world\012Enter a number>4 squared is 16\012Enter a number>
5 squared is 25\012Enter a number>6 squared is 36\012Enter a number>Bye\012'

The result is a single, long string, containing the concatenation of all text written to standard output. To make this look better, we can split it up with the standard string module:

>>> from string import split
>>> for line in split(output, '\n'): print line
...
Hello stream world
Enter a number>4 squared is 16
Enter a number>5 squared is 25
Enter a number>6 squared is 36
Enter a number>Bye

Better still, we can reuse the more.py module we saw earlier in this chapter; it's less to type and remember, and is already known to work well:

>>> from PP2E.System.more import more
>>> more(output)
Hello stream world
Enter a number>4 squared is 16
Enter a number>5 squared is 25
Enter a number>6 squared is 36
Enter a number>Bye

This is an artificial example, of course, but the techniques illustrated are widely applicable. For example, it's straightforward to add a GUI interface to a program written to interact with a command-line user. Simply intercept standard output with an object like the Output class shown earlier, and throw the text string up in a window. Similarly, standard input can be reset to an object that fetches text from a graphical interface (e.g., a popped-up dialog box). Because classes are plug-and-play compatible with real files, we can use them in any tool that expects a file. Watch for a GUI stream-redirection module named guiStreams in Chapter 9.

2.10.3 Other Redirection Options

Earlier in this chapter, we also studied the built-in os.popen function, which provides a way to redirect another command's streams from within a Python program. As we saw, this function runs a shell command line (e.g., a string we would normally type at a DOS or csh prompt), but returns a Python file-like object connected to the command's input or output stream. Because of that, the os.popen tool can be considered another way to redirect streams of spawned programs, and a cousin to the techniques we just met: Its effect is much like the shell | command-line pipe syntax for redirecting streams to programs (in fact its name means "pipe open"), but it is run within a script and provides a file-like interface to piped streams. It's similar in spirit to the redirect function, but is based on running programs (not calling functions), and the command's streams are processed in the spawning script as files (not tied to class objects).

By passing in the desired mode flag, we redirect a spawned program's input or output streams to a file in the calling scripts:

C:\...\PP2E\System\Streams>type hello-out.py
print 'Hello shell world'
 
C:\...\PP2E\System\Streams>type hello-in.py
input = raw_input( )
open('hello-in.txt', 'w').write('Hello ' + input + '\n')
 
C:\...\PP2E\System\Streams>python
>>> import os
>>> pipe = os.popen('python hello-out.py') # 'r' is default--read stdout
>>> pipe.read( )
'Hello shell world\012'
 
>>> pipe = os.popen('python hello-in.py', 'w')
>>> pipe.write('Gumby\n') # 'w'--write to program stdin
>>> pipe.close( ) # \n at end is optional
>>> open('hello-in.txt').read( )
'Hello Gumby\012'

The popen call is also smart enough to run the command string as an independent process on Unix and Linux. There are additional popen-like tools in the Python library that allow scripts to connect to more than one of the commands' streams. For instance, the popen2 module includes functions for hooking into both a command's input and output streams (popen2.popen2), and another for connecting to standard error as well (popen2.popen3):

import popen2
childStdout, childStdin = popen2.popen2('python hello-in-out.py')
childStdin.write(input)
output = childStdout.read( )
 
childStdout, childStdin, childStderr = popen2.popen3('python hello-in-out.py')

These two calls work much like os.popen, but connect additional streams. When I originally wrote this, these calls only worked on Unix-like platforms, not on Windows, because they relied on a fork call in Python 1.5.2. As of the Python 2.0 release, they now work well on Windows too.

Speaking of which: on Unix-like platforms, the combination of the calls os.fork, os.pipe, os.dup, and some os.exec variants can be used to start a new independent program with streams connected to the parent program's streams (that's how popen2 works its magic). As such, it's another way to redirect streams, and a low-level equivalent to tools like os.popen. See Chapter 3 for more on all these calls, especially its section on pipes.

Python 2.0 now also makes the popen2 and popen3 calls available in the os module. (For example, os.popen2 is the same as popen2.popen2, except that the order of stdin and stdout in the call's result tuple is swapped.) In addition, the 2.0 release extends the print statement to include an explicit file to which output is to be sent. A statement of the form print >>file stuff prints stuff to file, instead of stdout. The net effect is similar to simply assigning sys.stdout to an object.

 

Capturing the stderr Stream

We've been focusing on stdin and stdout redirection, but stderr can be similarly reset to files, pipes, and objects. This is straightforward within a Python script. For instance, assigning sys.stderr to another instance of a class like Output in the preceding example allows your script to intercept text written to standard error too. The popen3 call mentioned previously also allows stderr to be intercepted within a script.

Redirecting standard error from a shell command line is a bit more complex, and less portable. On most Unix-like systems, we can usually capture stderr output by using shell-redirection syntax of the form command 2>&1. This won't work on Windows 9x platforms, though, and can even vary per Unix shell; see your shell's manpages for more details.

 

2.11 File Tools

External files are at the heart of much of what we do with shell utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It's a core programming concept.

In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. Technically, open gives direct access to the stdio filesystem calls in the system's C library -- it returns a new file object that is connected to the external file, and has methods that map more or less directly to file calls on your machine. The open function also provides a portable interface to the underlying filesystem -- it works the same on every platform Python runs on.

Other file-related interfaces in Python allow us to do things such as manipulate lower-level descriptor-based files (module os), store objects away in files by key (modules anydbm and shelve), and access SQL databases. Most of these are larger topics addressed in Chapter 16. In this section, we take a brief tutorial look at the built-in file object, and explore a handful of more advanced file-related topics. As usual, you should consult the library manual's file object entry for further details and methods we don't have space to cover here.

2.11.1 Built-in File Objects

For most purposes, the open function is all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines), writing data (write, writelines), freeing system resources (close), moving about in the file (seek), forcing data to be transferred out of buffers (flush), fetching the underlying file handle (fileno), and more. Since the built-in file object is so easy to use, though, let's jump right in to a few interactive examples.

2.11.1.1 Output files

To make a new file, call open with two arguments: the external name of the file to be created, and a mode string "w" (short for "write"). To store data on the file, call the file object's write method with a string containing the data to store, and then call the close method to close the file if you wish to open it again within the same program or session:

C:\temp>python
>>> file = open('data.txt', 'w') # open output file object: creates
>>> file.write('Hello file world!\n') # writes strings verbatim
>>> file.write('Bye file world.\n')
>>> file.close( )  # closed on gc and exit too

And that's it -- you've just generated a brand new text file on your computer, no matter which computer you type this code on:

C:\temp>dir data.txt /B
data.txt
 
C:\temp>type data.txt
Hello file world!
Bye file world.

There is nothing unusual about the new file at all; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI too.

2.11.1.1.1 Opening

In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string; if we pass just a simple filename without a path, the file will appear in Python's current working directory. That is, it shows up in the place where the code is run -- here, directory C:\temp on my machine is implied by the bare filename data.txt, so this really creates a file at C:\temp\data.txt. See Section 2.7 earlier in this chapter for a refresher on this topic.

Also note that when opening in "w" mode, Python either creates the external file if it does not yet exist, or erases the file's current contents if it is already present on your machine (so be careful out there).

2.11.1.1.2 Writing

Notice that we added an explicit \n end-of-line character to lines written to the file; unlike the print statement, file write methods write exactly what they are passed, without any extra formatting. The string passed to write shows up byte-for-byte on the external file.

Output files also sport a writelines method, which simply writes all the strings in a list one at a time, without any extra formatting added. For example, here is a writelines equivalent to the two write calls shown earlier:

file.writelines(['Hello file world!\n', 'Bye file world.\n'])

This call isn't as commonly used (and can be emulated with a simple for loop), but is convenient in scripts that save output in a list to be written later.

2.11.1.1.3 Closing

The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (i.e., when it is no longer referenced), and when the Python session or program exits. Because of that, close calls are often optional. In fact, it's common to see file-processing code in Python like this:

open('somefile.txt').write("G'day Bruce\n")

Since this expression makes a temporary file object, writes to it immediately, and does not save a reference to it, the file object is reclaimed and closed right away without ever having called the close method explicitly.

But note that it's not impossible that this auto-close on reclaim file feature may change in future Python releases. Moreover, the JPython Java-based Python implementation discussed later does not reclaim files as immediately as the standard Python system (it uses Java's garbage collector). If your script makes many files and your platform limits the number of open files per program, explicit close calls are a robust habit to form.

2.11.1.2 Input files

Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of "r" (for "read") or no mode flag at all (it defaults to "r" if omitted). Once opened, we can read the lines of a text file with the readlines method:

>>> file = open('data.txt', 'r') # open input file object
>>> for line in file.readlines( ): # read into line string list
... print line,  # lines have '\n' at end
...
Hello file world!
Bye file world.

The readlines method loads the entire contents of the file into memory, and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:

· file.read( ) returns a string containing all the bytes stored in the file.

· file.read(N) returns a string containing the next N bytes from the file.

· file.readline( ) reads through the next \n and returns a line string.

· file.readlines( ) reads the entire file and returns a list of line strings.

Let's run these method calls to read files, lines, and bytes:

>>> file.seek(0) # go back to the front of file
>>> file.read( )  # read entire file into string
'Hello file world!\012Bye file world.\012'
 
>>> file.seek(0)
>>> file.readlines( )
['Hello file world!\012', 'Bye file world.\012']
 
>>> file.seek(0)
>>> file.readline( )
'Hello file world!\012'
>>> file.readline( )
'Bye file world.\012'
 
>>> file.seek(0)
>>> file.read(1), file.read(8)
('H', 'ello fil')

All these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:

· read( ) and readlines( ) load the entire file into memory all at once. That makes them handy for grabbing a file's contents with as little code as possible. It also makes them very fast, but costly for huge files -- loading a multi-gigabyte file into memory is not generally a good thing to do.

· On the other hand, because the readline( ) and read(N) calls fetch just part of the file (the next line, or N-byte block), they are safer for potentially big files, but a bit less convenient, and usually much slower. If speed matters and your files aren't huge, read or readlines may be better choices.

By the way, the seek(0) call used repeatedly here means "go back to the start of the file." In files, all read and write operations take place at the current position; files normally start at offset when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation. Python's seek method also accepts an optional second argument having one of three values -- 0 for absolute file positioning (the default), 1 to seek relative to the the current position, and 2 to seek relative to the file's end. When seek is passed only an offset argument as above, it's roughly a file rewind operation.

2.11.1.3 Other file object modes

Besides "w" and "r", most platforms support an "a" open mode string, meaning "append." In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:

>>> file = open('data.txt', 'a') # open in append mode: doesn't erase
>>> file.write('The Life of Brian') # added at end of existing data
>>> file.close( )
>>>
>>> open('data.txt').read( ) # open and read entire file
'Hello file world!\012Bye file world.\012The Life of Brian'

Most files are opened using the sorts of calls we just ran, but open actually allows up to three arguments for more specific processing needs -- the filename, the open mode, and a buffer size. All but the first of these are optional: if omitted, the open mode argument defaults to "r" (input), and the buffer size policy is to enable buffering on most platforms. Here are a few things you should know about all three open arguments:

Filename

As mentioned, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described earlier). In general, any filename form you can type in your system shell will work in an open call. For instance, a filename argument r'..\temp\spam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory's parent -- up one, and down to directory temp.

Open mode

The open function accepts other modes too, some of which are not demonstrated in this book (e.g., r+, w+, and a+ to open for updating, and any mode string with a "b" to designate binary mode). For instance, mode r+ means both reads and writes are allowed on the file, and wb writes data in binary mode (more on this in the next section). Generally, whatever you could use as a mode string in the C language's fopen call on your platform will work in the Python open function, since it really just calls fopen internally. (If you don't know C, don't sweat this point.) Notice that the contents of files are always strings in Python programs regardless of mode: read methods return a string, and we pass a string to write methods.

Buffer size

The open call also takes an optional third buffer size argument, which lets you control stdio buffering for the file -- the way that data is queued up before being transferred to boost performance. If passed, means file operations are unbuffered (data is transferred immediately), 1 means they are line buffered, any other positive value means use a buffer of approximately that size, and a negative value means to use the system default (which you get if no third argument is passed, and generally means buffering is enabled). The buffer size argument works on most platforms, but is currently ignored on platforms that don't provide the sevbuf system call.

2.11.1.4 Binary data files

The preceding examples all process simple text files. On most platforms, Python scripts can also open and process files containing binary data -- JPEG images, audio clips, and anything else that can be stored in files. The primary difference in terms of code is the mode argument passed to the built-in open function:

>>> file = open('data.txt', 'wb') # open binary output file
>>> file = open('data.txt', 'rb') # open binary input file

Once you've opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. (readline and readlines don't make sense here, though: binary data isn't line-oriented.)

In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. This works because Python string objects can always contain character bytes of any value (though some may look odd if printed). Interestingly, even a byte of value zero can be embedded in a Python string; it's called \0 in escape-code notation, and does not terminate strings in Python as it does in C. For instance:

>>> data = "a\0b\0c"
>>> data
'a\000b\000c'
>>> len(data)
5

Instead of relying on a terminator character, Python keeps track of a string's length explicitly. Here, data references a string of length 5, that happens to contain two zero-value bytes; they print in octal escape form as \000. Because no character codes are reserved, it's okay to read binary data with zero bytes (and other values) into a string in Python.

2.11.1.5 End-of-line translations on Windows

Strictly speaking, on some platforms you may not need the "b" at the end of the open mode argument to process binary files; the "b" is simply ignored, so modes "r" and "w" work just as well. In fact, the "b" in mode flag strings is usually only required for binary files on Windows. To understand why, though, you need to know how lines are terminated in text files.

For historical reasons, the end of a line of text in a file is represented by different characters on different platforms: it's a single \n character on Unix and Linux, but the two-character sequence \r\n on Windows.[9] That's why files moved between Linux and Windows may look odd in your text editor after transfer -- they may still be stored using the original platform's end-of-line convention. For example, most Windows editors handle text in Unix format, but Notepad is a notable exception -- text files copied from Unix or Linux usually look like one long line when viewed in Notepad, with strange characters inside (\n).

Python scripts don't normally need to care, because the Windows port (really, the underlying C compiler on Windows) automatically maps the DOS \r\n sequence to a single \n. It works like this -- when scripts are run on Windows:

· For files opened in text mode, \r\n is translated to \n when input.

· For files opened in text mode, \n is translated to \r\n when output.

· For files opened in binary mode, no translation occurs on input or output.

· On Unix-like platforms, no translations occur, regardless of open modes.

There are two important consequences of all these rules to keep in mind. First, the end of line character is almost always represented as a single \n in all Python scripts, regardless of how it is stored in external files on the underlying platform. By mapping to and from \n on input and output, the Windows port hides the platform-specific difference.

The second consequence of the mapping is more subtle: if you mean to process binary data files on Windows, you generally must be careful to open those files in binary mode ("rb", "wb"), not text mode ("r", "w"). Otherwise, the translations listed previously could very well corrupt data as it is input or output. It's not impossible that binary data would by chance contain bytes with values the same as the DOS end-line characters, \r and \n. If you process such binary files in text mode on Windows, \r bytes may be incorrectly discarded when read, and \n bytes may be erroneously expanded to \r\n when written. The net effect is that your binary data will be trashed when read and written -- probably not quite what you want! For example, on Windows:

>>> len('a\0b\rc\r\nd') # 4 escape code bytes
8
>>> open('temp.bin', 'wb').write('a\0b\rc\r\nd')  # write binary data to file
 
>>> open('temp.bin', 'rb').read( ) # intact if read as binary
'a\000b\015c\015\012d'
 
>>> open('temp.bin', 'r').read( ) # loses a \r in text mode!
'a\000b\015c\012d'
 
>>> open('temp.bin', 'w').write('a\0b\rc\r\nd')  # adds a \r in text mode!
>>> open('temp.bin', 'rb').read( )
'a\000b\015c\015\015\012d'

This is only an issue when running on Windows, but using binary open modes "rb" and "wb" for binary files everywhere won't hurt on other platforms, and will help make your scripts more portable (you never know when a Unix utility may wind up seeing action on your PC).

There are other times you may want to use binary file open modes too. For instance, in Chapter 5, we'll meet a script called fixeoln_one that translates between DOS and Unix end-of-line character conventions in text files. Such a script also has to open text files in binary mode to see what end-of-line characters are truly present on the file; in text mode, they would already be translated to \n by the time they reached the script.

2.11.2 File Tools in the os Module

The os module contains an additional set of file-processing functions that are distinct from the built-in file object tools demonstrated in previous examples. For instance, here is a very partial list of os file-related calls:

os.open( path, flags, mode

Opens a file, returns its descriptor

os.read( descriptor, N

Reads at most N bytes, returns a string

os.write( descriptor, string

Writes bytes in string to the file

os.lseek( descriptor, position

Moves to position in the file

Technically, os calls process files by their descriptors -- integer codes or "handles" that identify files in the operating system. Because the descriptor-based file tools in os are lower-level and more complex than the built-in file objects created with the built-in open function, you should generally use the latter for all but very special file-processing needs.[10]

To give you the general flavor of this tool-set, though, let's run a few interactive experiments. Although built-in file objects and os module descriptor files are processed with distinct toolsets, they are in fact related -- the stdio filesystem used by file objects simply adds a layer of logic on top of descriptor-based files.

In fact, the fileno file object method returns the integer descriptor associated with a built-in file object. For instance, the standard stream file objects have descriptors 0, 1, and 2; calling the os.write function to send data to stdout by descriptor has the same effect as calling the sys.stdout.write method:

>>> import sys
>>> for stream in (sys.stdin, sys.stdout, sys.stderr):
... print stream.fileno( ),
...
0 1 2
 
>>> sys.stdout.write('Hello stdio world\n') # write via file method
Hello stdio world
 
>>> import os
>>> os.write(1, 'Hello descriptor world\n')  # write via os module
Hello descriptor world
23

Because file objects we open explicitly behave the same way, it's also possible to process a given real external file on the underlying computer, through the built-in open function, tools in module os, or both:

>>> file = open(r'C:\temp\spam.txt', 'w')  # create external file
>>> file.write('Hello stdio file\n')  # write via file method
>>>
>>> fd = file.fileno( )
>>> print fd
3
>>> os.write(fd, 'Hello descriptor file\n')  # write via os module
22
>>> file.close( )
>>>
C:\WINDOWS>type c:\temp\spam.txt  # both writes show up
Hello descriptor file
Hello stdio file
2.11.2.1 Open mode flags

So why the extra file tools in os? In short, they give more low-level control over file processing. The built-in open function is easy to use, but is limited by the underlying stdio filesystem that it wraps -- buffering, open modes, and so on, are all per stdio defaults.[11] Module os lets scripts be more specific; for example, the following opens a descriptor-based file in read-write and binary modes, by performing a binary "or" on two mode flags exported by os:

>>> fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>> os.read(fdfile, 20)
'Hello descriptor fil'
>>> os.lseek(fdfile, 0, 0)  # go back to start of file
0
>>> os.read(fdfile, 100)  # binary mode retains "\r\n"
'Hello descriptor file\015\012Hello stdio file\015\012'
 
>>> os.lseek(fdfile, 0, 0)
0
>>> os.write(fdfile, 'HELLO')  # overwrite first 5 bytes
5

On some systems, such open flags let us specify more advanced things like exclusive access (O_EXCL) and nonblocking modes (O_NONBLOCK) when a file is opened. Some of these flags are not portable across platforms (another reason to use built-in file objects most of the time); see the library manual or run a dir(os) call on your machine for an exhaustive list of other open flags available.

We saw earlier how to go from file object to field descriptor with the fileno file method; we can also go the other way -- the os.fdopen call wraps a file descriptor in a file object. Because conversions work both ways, we can generally use either tool set -- file object, or os module:

>>> objfile = os.fdopen(fdfile)
>>> objfile.seek(0)
>>> objfile.read( )
'HELLO descriptor file\015\012Hello stdio file\015\012'
2.11.2.2 Other os file tools

The os module also includes an assortment of file tools that accept a file pathname string, and accomplish file-related tasks such as renaming (os.rename), deleting (os.remove), and changing the file's owner and permission settings (os.chown, os.chmod). Let's step through a few examples of these tools in action:

>>> os.chmod('spam.txt', 0777)  # enabled all accesses

This os.chmod file permissions call passes a nine-bit bitstring, composed of three sets of three bits each. From left to right, the three sets represent the file's owning user, the file's group, and all others. Within each set, the three bits reflect read, write, and execute access permissions. When a bit is "1" in this string, it means that the corresponding operation is allowed for the assessor. For instance, octal 0777 is a string of nine "1" bits in binary, so it enables all three kinds of accesses, for all three user groups; octal 0600 means that the file can be only read and written by the user that owns it (when written in binary, 0600 octal is really bits 110 000 000).

This scheme stems from Unix file permission settings, but works on Windows as well. If it's puzzling, either check a Unix manpage for chmod, or see the fixreadonly example in Chapter 5, for a practical application (it makes read-only files copied off a CD-ROM writable).

>>> os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt') # (from, to)
>>>
>>> os.remove(r'C:\temp\spam.txt') # delete file
Traceback (innermost last):
 File "<stdin>", line 1, in ?
OSError: [Errno 2] No such file or directory: 'C:\\temp\\spam.txt'
>>>
>>> os.remove(r'C:\temp\eggs.txt')

The os.rename call used here changes a file's name; the os.remove file deletion call deletes a file from your system, and is synonymous with os.unlink; the latter reflects the call's name on Unix, but was obscure to users of other platforms. The os module also exports the stat system call:

>>> import os
>>> info = os.stat(r'C:\temp\spam.txt')
>>> info
(33206, 0, 2, 1, 0, 0, 41, 968133600, 968176258, 968176193)
 
>>> import stat
>>> info[stat.ST_MODE], info[stat.ST_SIZE]
(33206, 41)
 
>>> mode = info[stat.ST_MODE]
>>> stat.S_ISDIR(mode), stat.S_ISREG(mode)
(0, 1)

The os.stat call returns a tuple of values giving low-level information about the named file, and the stat module exports constants and functions for querying this information in a portable way. For instance, indexing an os.stat result on offset stat.ST_SIZE returns the file's size, and calling stat.S_ISDIR with the mode item from an os.stat result checks whether the file is a directory. As shown earlier, though, both of these operations are available in the os.path module too, so it's rarely necessary to use os.stat except for low-level file queries:

>>> path = r'C:\temp\spam.txt'
>>> os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(0, 1, 41)

2.11.3 File Scanners

Unlike some shell-tool languages, Python doesn't have an implicit file-scanning loop procedure, but it's simple to write a general one that we can reuse for all time. The module in Example 2-11 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.

Example 2-11. PP2E\System\Filetools\scanfile.py
def scanner(name, function): 
 file = open(name, 'r') # create a file object
 while 1:
 line = file.readline( ) # call file methods
 if not line: break # until end-of-file
  function(line) # call a function object
 file.close( ) 

The scanner function doesn't care what line-processing function is passed in, and that accounts for most of its generality -- it is happy to apply any single-argument function that exists now or in the future to all the lines in a text file. If we code this module and put it in a directory on PYTHONPATH, we can use it any time we need to step through a file line-by-line. Example 2-12 is a client script that does simple line translations.

Example 2-12. PP2E\System\Filetools\commands.py
#!/usr/local/bin/python
from sys import argv
from scanfile import scanner
 
def processLine(line): # define a function
 if line[0] == '*':  # applied to each line
 print "Ms.", line[1:-1]
 elif line[0] == '+': 
 print "Mr.", line[1:-1] # strip 1st and last char
 else:
 raise 'unknown command', line # raise an exception
 
filename = 'data.txt'
if len(argv) == 2: filename = argv[1] # allow file name cmd arg
scanner(filename, processLine) # start the scanner

If, for no readily obvious reason, the text file hillbillies.txt contains the following lines:

*Granny
+Jethro
*Elly-Mae
+"Uncle Jed"

then our commands script could be run as follows:

C:\...\PP2E\System\Filetools>python commands.py hillbillies.txt
Ms. Granny
Mr. Jethro
Ms. Elly-Mae
Mr. "Uncle Jed"

As a rule of thumb, though, we can usually speed things up by shifting processing from Python code to built-in tools. For instance, if we're concerned with speed (and memory space isn't tight), we can make our file scanner faster by using the readlines method to load the file into a list all at once, instead of the manual readline loop in Example 2-11:

def scanner(name, function): 
 file = open(name, 'r') # create a file object
 for line in file.readlines( ): # get all lines at once 
 function(line) # call a function object
 file.close( ) 

And if we have a list of lines, we can work more magic with the map built-in function. Here's a minimalist's version; the for loop is replaced by map, and we let Python close the file for us when it is garbage-collected (or the script exits):

def scanner(name, function): 
 map(function, open(name, 'r').readlines( ))

But what if we also want to change a file while scanning it? Example 2-13 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.

Example 2-13. PP2E\System\Filetools\filters.py
def filter_files(name, function): # filter file through function
 input = open(name, 'r') # create file objects
 output = open(name + '.out', 'w') # explicit output file too
 for line in input.readlines( ):
 output.write(function(line)) # write the modified line
 input.close( ) 
 output.close( ) # output has a '.out' suffix
 
def filter_stream(function):
 import sys # no explicit files
 while 1: # use standard streams
 line = sys.stdin.readline( ) # or: raw_input( )
 if not line: break
 print function(line), # or: sys.stdout.write( )
 
if __name__ == '__main__': 
 filter_stream(lambda line: line) # copy stdin to stdout if run

Since the standard streams are preopened for us, they're often easier to use. This module is more useful when imported as a library (clients provide the line-processing function); when run standalone it simply parrots stdin to stdout:

C:\...\PP2E\System\Filetools>python filters.py < ..\System.txt
This directory contains operating system interface examples.
 
Many of the examples in this unit appear elsewhere in the examples
distribution tree, because they are actually used to manage other
programs. See the README.txt files in the subdirectories here
for pointers.

Brutally observant readers may notice that this last file is named filters.py (with an "s"), not filter.py. I originally named it the latter, but changed its name when I realized that a simple import of the filename (e.g., "import filter") assigns the module to a local name "filter," thereby hiding the built-in filter function. This is a built-in functional programming tool, not used very often in typical scripts; but be careful to avoid picking built-in names for module files. I will if you will.

2.11.4 Making Files Look Like Lists

One last file-related trick has proven popular enough to merit an introduction here. Although file objects only export method calls (e.g., file.read( )), it is easy to use classes to make them look more like data structures, and hide some of the underlying file call details. The module in Example 2-14 defines a FileList object that "wraps" a real file to add sequential indexing support.

Example 2-14. PP2E\System\Filetools\filelist.py
class FileList:
 def __init__(self, filename):
 self.file = open(filename, 'r') # open and save file
 def __getitem__(self, i): # overload indexing
 line = self.file.readline( )
 if line:
 return line # return the next line
 else:
 raise IndexError # end 'for' loops, 'in'
 def __getattr__(self, name):
 return getattr(self.file, name) # other attrs from real file

This class defines three specially named methods:

· The __init__ method is called whenever a new object is created.

· The __getitem__ method intercepts indexing operations.

· The __getattr__ method handles undefined attribute references.

This class mostly just extends the built-in file object to add indexing. Most standard file method calls are simply delegated (passed off) to the wrapped file by __getattr__. Each time a FileList object is indexed, though, its __getitem__ method returns the next line in the actual file. Since for loops work by repeatedly indexing objects, this class lets us iterate over a wrapped file as though it were an in-memory list:

>>> from filelist import FileList
>>> for line in FileList('hillbillies.txt'):
... print '>', line,
...
> *Granny
> +Jethro
> *Elly-Mae
> +"Uncle Jed"

This class could be made much more sophisticated and list-like too. For instance, we might overload the + operation to concatenate a file onto the end of an output file, allow random indexing operations that seek among the file's lines to resolve the specified offset, and so on. But since coding all such extensions takes more space than I have available here, I'll leave them as suggested exercises.

2.12 Directory Tools

One of the more common tasks in the shell utilities domain is applying an operation to a set of files in a directory -- a "folder" in Windows-speak. By running a script on a batch of files, we can automate (that is, script) tasks we might have to otherwise run repeatedly by hand.

For instance, suppose you need to search all of your Python files in a development directory for a global variable name (perhaps you've forgotten where it is used). There are many platform-specific ways to do this (e.g., the grep command in Unix), but Python scripts that accomplish such tasks will work on every platform where Python works -- Windows, Unix, Linux, Macintosh, and just about any other in common use today. Simply copy your script to any machine you wish to use it on, and it will work, regardless of which other tools are available there.

2.12.1 Walking One Directory

The most common way to go about writing such tools is to first grab hold of a list of the names of the files you wish to process, and then step through that list with a Python for loop, processing each file in turn. The trick we need to learn here, then, is how to get such a directory list within our scripts. There are at least three options: running shell listing commands with os.popen, matching filename patterns with glob.glob, and getting directory listings with os.listdir. They vary in interface, result format, and portability.

2.12.1.1 Running shell listing commands with os.popen

Quick: How did you go about getting directory file listings before you heard of Python? If you're new to shell tools programming, the answer may be: "Well, I started a Windows file explorer and clicked on stuff," but I'm thinking in terms of less GUI-oriented command-line mechanisms here (and answers submitted in Perl and Tcl only get partial credit).

On Unix, directory listings are usually obtained by typing ls in a shell; on Windows, they can be generated with a dir command typed in an MS-DOS console box. Because Python scripts may use os.popen to run any command line we can type in a shell, they also are the most general way to grab a directory listing inside a Python program. We met os.popen earlier in this chapter; it runs a shell command string and gives us a file object from which we can read the command's output. To illustrate, let's first assume the following directory structures (yes, I have both dir and ls commands on my Windows laptop; old habits die hard):

C:\temp>dir /B
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir
 
C:\temp>ls
about-pp.html about-ppr2e.html python1.5.tar.gz
about-pp2e.html newdir
 
C:\temp>ls newdir
more temp1 temp2 temp3

The newdir name is a nested subdirectory in C:\temp here. Now, scripts can grab a listing of file and directory names at this level by simply spawning the appropriate platform-specific command line, and reading its output (the text normally thrown up on the console window):

C:\temp>python
>>> import os
>>> os.popen('dir /B').readlines( )
['about-pp.html\012', 'python1.5.tar.gz\012', 'about-pp2e.html\012', 
'about-ppr2e.html\012', 'newdir\012']

Lines read from a shell command come back with a trailing end-line character, but it's easy enough to slice off:

>>> for line in os.popen('dir /B').readlines( ): 
... print line[:-1]
...
about-pp.html
python1.5.tar.gz
about-pp2e.html
about-ppr2e.html
newdir

Both dir and ls commands let us be specific about filename patterns to be matched and directory names to be listed; again, we're just running shell commands here, so anything you can type at a shell prompt goes:

>>> os.popen('dir *.html /B').readlines( )
['about-pp.html\012', 'about-pp2e.html\012', 'about-ppr2e.html\012']
 
>>> os.popen('ls *.html').readlines( )
['about-pp.html\012', 'about-pp2e.html\012', 'about-ppr2e.html\012']
 
>>> os.popen('dir newdir /B').readlines( )
['temp1\012', 'temp2\012', 'temp3\012', 'more\012']
 
>>> os.popen('ls newdir').readlines( )
['more\012', 'temp1\012', 'temp2\012', 'temp3\012']

These calls use general tools and all work as advertised. As we noted earlier, though, the downsides of os.popen are that it is nonportable (it doesn't work well in a Windows GUI application in Python 1.5.2 and earlier, and requires using a platform-specific shell command), and it incurs a performance hit to start up an independent program. The following two alternative techniques do better on both counts.

2.12.1.2 The glob module

The term "globbing" comes from the * wildcard character in filename patterns -- per computing folklore, a * matches a "glob" of characters. In less poetic terms, globbing simply means collecting the names of all entries in a directory -- files and subdirectories -- whose names match a given filename pattern. In Unix shells, globbing expands filename patterns within a command line into all matching file- names before the command is ever run. In Python, we can do something similar by calling the glob.glob built-in with a pattern to expand:

>>> import glob
>>> glob.glob('*')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']
 
>>> glob.glob('*.html')
['about-pp.html', 'about-pp2e.html', 'about-ppr2e.html']
 
>>> glob.glob('newdir/*')
['newdir\\temp1', 'newdir\\temp2', 'newdir\\temp3', 'newdir\\more']

The glob call accepts the usual filename pattern syntax used in shells (e.g., ? means any one character, * means any number of characters, and [] is a character selection set).[12] The pattern should include a directory path if you wish to glob in something other than the current working directory, and the module accepts either Unix or DOS-style directory separators (/ or \). This call also is implemented without spawning a shell command, and so is likely to be faster and more portable across all Python platforms than the os.popen schemes shown earlier.

Technically speaking, glob is a bit more powerful than described so far. In fact, using it to list files in one directory is just one use of its pattern-matching skills. For instance, it can also be used to collect matching names across multiple directories, simply because each level in a passed-in directory path can be a pattern too:

C:\temp>python
>>> import glob
>>> for name in glob.glob('*examples/L*.py'): print name
...
cpexamples\Launcher.py
cpexamples\Launch_PyGadgets.py
cpexamples\LaunchBrowser.py
cpexamples\launchmodes.py
examples\Launcher.py
examples\Launch_PyGadgets.py
examples\LaunchBrowser.py
examples\launchmodes.py
 
>>> for name in glob.glob(r'*\*\visitor_find*.py'): print name
...
cpexamples\PyTools\visitor_find.py
cpexamples\PyTools\visitor_find_quiet2.py
cpexamples\PyTools\visitor_find_quiet1.py
examples\PyTools\visitor_find.py
examples\PyTools\visitor_find_quiet2.py
examples\PyTools\visitor_find_quiet1.py

In the first call here, we get back filenames from two different directories that matched the *examples pattern; in the second, both of the first directory levels are wildcards, so Python collects all possible ways to reach the base filenames. Using os.popen to spawn shell commands only achieves the same effect if the underlying shell or listing command does too.

2.12.1.3 The os.listdir call

The os module's listdir call provides yet another way to collect filenames in a Python list. It takes a simple directory name string, not a filename pattern, and returns a list containing the names of all entries in that directory -- both simple files and nested directories -- for use in the calling script:

>>> os.listdir('.')
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']
 
>>> os.listdir(os.curdir)
['about-pp.html', 'python1.5.tar.gz', 'about-pp2e.html', 'about-ppr2e.html',
'newdir']
 
>>> os.listdir('newdir')
['temp1', 'temp2', 'temp3', 'more']

This too is done without resorting to shell commands, and so is portable to all major Python platforms. The result is not in any particular order (but can be sorted with the list sort method), returns base filenames without their directory path prefixes, and includes names of both files and directories at the listed level.

To compare all three listing techniques, let's run them side by side on an explicit directory here. They differ in some ways but are mostly just variations on a theme -- os.popen sorts names and returns end-of-lines, glob.glob accepts a pattern and returns filenames with directory prefixes, and os.listdir takes a simple directory name and returns names without directory prefixes:

>>> os.popen('ls C:\PP2ndEd').readlines( )
['README.txt\012', 'cdrom\012', 'chapters\012', 'etc\012', 'examples\012',
'examples.tar.gz\012', 'figures\012', 'shots\012']
 
>>> glob.glob('C:\PP2ndEd\*')
['C:\\PP2ndEd\\examples.tar.gz', 'C:\\PP2ndEd\\README.txt', 
'C:\\PP2ndEd\\shots', 'C:\\PP2ndEd\\figures', 'C:\\PP2ndEd\\examples',
'C:\\PP2ndEd\\etc', 'C:\\PP2ndEd\\chapters', 'C:\\PP2ndEd\\cdrom']
 
>>> os.listdir('C:\PP2ndEd')
['examples.tar.gz', 'README.txt', 'shots', 'figures', 'examples', 'etc',
'chapters', 'cdrom']

Of these three, glob and listdir are generally better options if you care about script portability, and listdir seems fastest in recent Python releases (but gauge its performance yourself -- implementations may change over time).

2.12.1.4 Splitting and joining listing results

In the last example, I pointed out that glob returns names with directory paths, but listdir gives raw base filenames. For convenient processing, scripts often need to split glob results into base files, or expand listdir results into full paths. Such translations are easy if we let the os.path module do all the work for us. For example, a script that intends to copy all files elsewhere will typically need to first split off the base filenames from glob results so it can add different directory names on the front:

>>> dirname = r'C:\PP2ndEd'
>>> for file in glob.glob(dirname + '/*'):
... head, tail = os.path.split(file)
... print head, tail, '=>', ('C:\\Other\\' + tail)
...
C:\PP2ndEd examples.tar.gz => C:\Other\examples.tar.gz
C:\PP2ndEd README.txt => C:\Other\README.txt
C:\PP2ndEd shots => C:\Other\shots
C:\PP2ndEd figures => C:\Other\figures
C:\PP2ndEd examples => C:\Other\examples
C:\PP2ndEd etc => C:\Other\etc
C:\PP2ndEd chapters => C:\Other\chapters
C:\PP2ndEd cdrom => C:\Other\cdrom

Here, the names after the => represent names that files might be moved to. Conversely, a script that means to process all files in a different directory than the one it runs in will probably need to prepend listdir results with the target directory name, before passing filenames on to other tools:

>>> for file in os.listdir(dirname):
... print os.path.join(dirname, file)
...
C:\PP2ndEd\examples.tar.gz
C:\PP2ndEd\README.txt
C:\PP2ndEd\shots
C:\PP2ndEd\figures
C:\PP2ndEd\examples
C:\PP2ndEd\etc
C:\PP2ndEd\chapters
C:\PP2ndEd\cdrom

2.12.2 Walking Directory Trees

Notice, though, that all of the preceding techniques only return the names of files in a single directory. What if you want to apply an operation to every file in every directory and subdirectory in a directory tree?

For instance, suppose again that we need to find every occurrence of a global name in our Python scripts. This time, though, our scripts are arranged into a module package : a directory with nested subdirectories, which may have subdirectories of their own. We could rerun our hypothetical single-directory searcher in every directory in the tree manually, but that's tedious, error-prone, and just plain no fun.

Luckily, in Python it's almost as easy to process a directory tree as it is to inspect a single directory. We can either collect names ahead of time with the find module, write a recursive routine to traverse the tree, or use a tree-walker utility built-in to the os module. Such tools can be used to search, copy, compare, and otherwise process arbitrary directory trees on any platform that Python runs on (and that's just about everywhere).

2.12.2.1 The find module

The first way to go hierarchical is to collect a list of all names in a directory tree ahead of time, and step through that list in a loop. Like the single-directory tools we just met, a call to the find.find built-in returns a list of both file and directory names. Unlike the tools described earlier, find.find also returns pathnames of matching files nested in subdirectories, all the way to the bottom of a tree:

C:\temp>python
>>> import find
>>> find.find('*')
['.\\about-pp.html', '.\\about-pp2e.html', '.\\about-ppr2e.html', 
'.\\newdir', '.\\newdir\\more', '.\\newdir\\more\\xxx.txt',
'.\\newdir\\more\\yyy.txt', '.\\newdir\\temp1', '.\\newdir\\temp2',
'.\\newdir\\temp3', '.\\python1.5.tar.gz']
 
>>> for line in find.find('*'): print line
...
.\about-pp.html
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
.\newdir\more
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\python1.5.tar.gz

We get back a list of full pathnames, that each include the top-level directory's path. By default, find collects names matching the passed-in pattern in the tree rooted at the current working directory, known as ".". If we want a more specific list, we can pass in both a filename pattern and a directory tree root to start at; here's how to collect HTML filenames at "." and below:

>>> find.find('*.html', '.')
['.\\about-pp.html', '.\\about-pp2e.html', '.\\about-ppr2e.html']

Incidentally, find.find is also the Python library's equivalent to platform-specific shell commands such as a find -print on Unix and Linux, and dir /B /S on DOS and Windows. Since we can usually run such shell commands in a Python script with os.popen, the following does the same work as find.find, but is inherently nonportable, and must start up a separate program along the way:

>>> import os
>>> for line in os.popen('dir /B /S').readlines( ): print line,
...
C:\temp\about-pp.html
C:\temp\python1.5.tar.gz
C:\temp\about-pp2e.html
C:\temp\about-ppr2e.html
C:\temp\newdir
C:\temp\newdir\temp1
C:\temp\newdir\temp2
C:\temp\newdir\temp3
C:\temp\newdir\more
C:\temp\newdir\more\xxx.txt
C:\temp\newdir\more\yyy.txt

If the find calls don't seem to work in your Python, try changing the import statement used to load the module from import find to from PP2E.PyTools import find. Alas, the Python standard library's find module has been marked as "deprecated" as of Python 1.6. That means it may be deleted from the standard Python distribution in the future, so pay attention to the next section; we'll use its topic later to write our own find module -- one that is also shipped on this book's CD (see http://examples.oreilly.com/python2).

2.12.2.2 The os.path.walk visitor

To make it easy to apply an operation to all files in a tree, Python also comes with a utility that scans trees for us, and runs a provided function at every directory along the way. The os.path.walk function is called with a directory root, function object, and optional data item, and walks the tree at the directory root and below. At each directory, the function object passed in is called with the optional data item, the name of the current directory, and a list of filenames in that directory (obtained from os.listdir). Typically, the function we provide scans the filenames list to process files at each directory level in the tree.

That description might sound horribly complex the first time you hear it, but os.path.walk is fairly straightforward once you get the hang of it. In the following code, for example, the lister function is called from os.path.walk at each directory in the tree rooted at ".". Along the way, lister simply prints the directory name, and all the files at the current level (after prepending the directory name). It's simpler in Python than in English:

>>> import os
>>> def lister(dummy, dirname, filesindir):
... print '[' + dirname + ']'
... for fname in filesindir:
... print os.path.join(dirname, fname)  # handle one file
...
>>> os.path.walk('.', lister, None)
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
.\newdir
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
.\newdir\more
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

In other words, we've coded our own custom and easily changed recursive directory listing tool in Python. Because this may be something we would like to tweak and reuse elsewhere, let's make it permanently available in a module file, shown in Example 2-15, now that we've worked out the details interactively.

Example 2-15. PP2E\System\Filetools\lister_walk.py
# list file tree with os.path.walk
import sys, os
 
def lister(dummy, dirName, filesInDir): # called at each dir
 print '[' + dirName + ']'
 for fname in filesInDir: # includes subdir names
 path = os.path.join(dirName, fname) # add dir name prefix
 if not os.path.isdir(path): # print simple files only
 print path
 
if __name__ == '__main__':
 os.path.walk(sys.argv[1], lister, None) # dir name in cmdline

This is the same code, except that directory names are filtered out of the filenames list by consulting the os.path.isdir test, to avoid listing them twice (see -- it's been tweaked already). When packaged this way, the code can also be run from a shell command line. Here it is being launched from a different directory, with the directory to be listed passed in as a command-line argument:

C:\...\PP2E\System\Filetools>python lister_walk.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

The walk paradigm also allows functions to tailor the set of directories visited by changing the file list argument in place. The library manual documents this further, but it's probably more instructive to simply know what walk truly looks like. Here is its actual Python-coded implementation for Windows platforms, with comments added to help demystify its operation:

def walk(top, func, arg): # top is the current dirname
 try:
  names = os.listdir(top) # get all file/dir names here
 except os.error: # they have no path prefix
 return
 func(arg, top, names) # run func with names list here
 exceptions = ('.', '..')
 for name in names: # step over the very same list
 if name not in exceptions: # but skip self/parent names
 name = join(top, name) # add path prefix to name
 if isdir(name):
  walk(name, func, arg) # descend into subdirs here

Notice that walk generates filename lists at each level with os.listdir, a call that collects both file and directory names in no particular order, and returns them without their directory paths. Also note that walk uses the very same list returned by os.listdir and passed to the function you provide, to later descend into subdirectories (variable names). Because lists are mutable objects that can be changed in place, if your function modifies the passed-in filenames list, it will impact what walk does next. For example, deleting directory names will prune traversal branches, and sorting the list will order the walk.

2.12.2.3 Recursive os.listdir traversals

The os.path.walk tool does tree traversals for us, but it's sometimes more flexible, and hardly any more work, to do it ourself. The following script recodes the directory listing script with a manual recursive traversal function. The mylister function in Example 2-16 is almost the same as lister in the prior script, but calls os.listdir to generate file paths manually, and calls itself recursively to descend into subdirectories.

Example 2-16. PP2E\System\Filetools\lister_recur.py
# list files in dir tree by recursion
import sys, os
 
def mylister(currdir):
 print '[' + currdir + ']'
 for file in os.listdir(currdir): # list files here
 path = os.path.join(currdir, file) # add dir path back
 if not os.path.isdir(path):
 print path
 else:
 mylister(path) # recur into subdirs
 
if __name__ == '__main__': 
 mylister(sys.argv[1]) # dir name in cmdline

This version is packaged as a script too (this is definitely too much code to type at the interactive prompt); its output is identical when run as a script:

C:\...\PP2E\System\Filetools>python lister_recur.py C:\Temp
[C:\Temp]
C:\Temp\about-pp.html
C:\Temp\python1.5.tar.gz
C:\Temp\about-pp2e.html
C:\Temp\about-ppr2e.html
[C:\Temp\newdir]
C:\Temp\newdir\temp1
C:\Temp\newdir\temp2
C:\Temp\newdir\temp3
[C:\Temp\newdir\more]
C:\Temp\newdir\more\xxx.txt
C:\Temp\newdir\more\yyy.txt

But this file is just as useful when imported and called elsewhere:

C:\temp>python
>>> from PP2E.System.Filetools.lister_recur import mylister
>>> mylister('.')
[.]
.\about-pp.html
.\python1.5.tar.gz
.\about-pp2e.html
.\about-ppr2e.html
[.\newdir]
.\newdir\temp1
.\newdir\temp2
.\newdir\temp3
[.\newdir\more]
.\newdir\more\xxx.txt
.\newdir\more\yyy.txt

We will make better use of most of this section's techniques in later examples in Chapter 5, and this book at large. For example, scripts for copying and comparing directory trees use the tree-walker techniques listed previously. Watch for these tools in action along the way. If you are interested in directory processing, also see the coverage of Python's old grep module in Chapter 5; it searches files, and can be applied to all files in a directory when combined with the glob module, but simply prints results and does not traverse directory trees by itself.

2.12.3 Rolling Your Own find Module

Over the last eight years, I've learned to trust Python's Benevolent Dictator. Guido generally does the right thing, and if you don't think so, it's usually only because you haven't yet realized how your own position is flawed. Trust me on this. On the other hand, it's not completely clear why the standard find module I showed you seems to have fallen into deprecation; it's a useful tool. In fact, I use it a lot -- it is often nice to be able to grab a list of files to process in a single function call, and step through it in a for loop. The alternatives -- os.path.walk, and recursive functions -- are more code-y, and tougher for beginners to digest.

I suppose the find module's followers (if there be any) could have defended it in long, drawn-out debates on the Internet, that would have spanned days or weeks, been joined by a large cast of heroic combatants, and gone just about nowhere. I decided to spend ten minutes whipping up a custom alternative instead. The module in Example 2-17 uses the standard os.path.walk call described earlier to reimplement a find operation for Python.

Example 2-17. PP2E\PyTools\find.py
#!/usr/bin/python
########################################################
# custom version of the now deprecated find module 
# in the standard library--import as "PyTools.find";
# equivalent to the original, but uses os.path.walk,
# has no support for pruning subdirs in the tree, and
# is instrumented to be runnable as a top-level script;
# results list sort differs slightly for some trees;
# exploits tuple unpacking in function argument lists;
########################################################
 
import fnmatch, os
 
def find(pattern, startdir=os.curdir):
 matches = []
 os.path.walk(startdir, findvisitor, (matches, pattern))
 matches.sort( )
 return matches
 
def findvisitor((matches, pattern), thisdir, nameshere):
 for name in nameshere:
 if fnmatch.fnmatch(name, pattern):
 fullpath = os.path.join(thisdir, name)
 matches.append(fullpath)
 
if __name__ == '__main__':
 import sys
 namepattern, startdir = sys.argv[1], sys.argv[2]
 for name in find(namepattern, startdir): print name

There's not much to this file; but calling its find function provides the same utility as the deprecated find standard module, and is noticeably easier than rewriting all of this file's code every time you need to perform a find-type search. To process every Python file in a tree, for instance, I simply type:

from PP2E.PyTools import find
for name in find.find('*.py'):
 ...do something with name...

As a more concrete example, I use the following simple script to clean out any old output text files located anywhere in the book examples tree:

C:\...\PP2E>type PyTools\cleanoutput.py
import os # delete old output files in tree
from PP2E.PyTools.find import find # only need full path if I'm moved
for filename in find('*.out.txt'): # use cat instead of type in Linux
 print filename
 if raw_input('View?') == 'y':
 os.system('type ' + filename)
 if raw_input('Delete?') == 'y':
 os.remove(filename)
 
C:\temp\examples>python %X%\PyTools\cleanoutput.py
.\Internet\Cgi-Web\Basics\languages.out.txt
View?
Delete?
.\Internet\Cgi-Web\PyErrata\AdminTools\dbaseindexed.out.txt
View?
Delete?y

To achieve such code economy, the custom find module calls os.path.walk to register a function to be called per directory in the tree, and simply adds matching filenames to the result list along the way.

New here, though, is the fnmatch module -- a standard Python module that performs Unix-like pattern matching against filenames, and was also used by the original find. This module supports common operators in name pattern strings: * (to match any number of characters), ? (to match any single character), and [...] and [!...] (to match any character inside the bracket pairs, or not); other characters match themselves.[13] To make sure that this alternative's results are similar, I also wrote the test module shown in Example 2-18.

Example 2-18. PP2E\PyTools\find-test.py
############################################################
# test custom find; the builtin find module is deprecated:
# if it ever goes away completely, replace all "import find"
# with "from PP2E.PyTools import find" (or add PP2E\PyTools
# to your path setting and just "import find"); this script 
# takes 4 seconds total time on my 650mhz Win98 notebook to
# run 10 finds over a directory tree of roughly 1500 names; 
############################################################
 
import sys, os, string
for dir in sys.path:
 if string.find(os.path.abspath(dir), 'PyTools') != -1:
 print 'removing', repr(dir)
 sys.path.remove(dir) # else may import both finds from PyTools, '.'!
 
import find  # get deprecated builtin (for now)
import PP2E.PyTools.find # later use: from PP2E.PyTools import find
print find
print PP2E.PyTools.find
 
assert find.find != PP2E.PyTools.find.find # really different?
assert string.find(str(find), 'Lib') != -1 # should be after path remove
assert string.find(str(PP2E.PyTools.find), 'PyTools') != -1 
 
startdir = r'C:\PP2ndEd\examples\PP2E'
for pattern in ('*.py', '*.html', '*.c', '*.cgi', '*'):
 print pattern, '=>'
 list1 = find.find(pattern, startdir)
 list2 = PP2E.PyTools.find.find(pattern, startdir)
 print len(list1), list1[-1]
 print len(list2), list2[-1]
 print list1 == list2,; list1.sort( ); print list1 == list2

There is some magic at the top of this script that I need to explain. To make sure that it can load both the standard library's find module and the custom one in PP2E\PyTools, it must delete the entry (or entries) on the module search path that point to the PP2E\PyTools directory, and import the custom version with a full package directory -- PP2E.PyTools.find. If not, we'd always get the same find module, the one in PyTools, no matter where this script is run from.

Here's why. Recall that Python always adds the directory containing a script being run to the front of sys.path. If we didn't delete that entry here, the import find statement would always load the custom find in PyTools, because the custom find.py module is in the same directory as the find-test.py script. The script's home directory would effectively hide the standard library's find. If that doesn't make sense, go back and reread Section 2.7 earlier in this chapter.

Below is the output of this tester, along with a few command-line invocations; unlike the original find, the custom version in Example 2-18 can be run as a command-line tool too. If you study the test output closely, you'll notice that the custom find differs only in an occasional sort order that I won't go into further here (the original find module used a recursive function, not os.path.walk); the "0 1" lines mean that results differ in order, but not content. Since find callers don't generally depend on precise filename result ordering, this is trivial:

C:\temp>python %X%\PyTools\find-test.py
removing 'C:\\PP2ndEd\\examples\\PP2E\\PyTools'
<module 'find' from 'C:\Program Files\Python\Lib\find.pyc'>
<module 'PP2E.PyTools.find' from 'C:\PP2ndEd\examples\PP2E\PyTools\find.pyc'>
*.py =>
657 C:\PP2ndEd\examples\PP2E\tounix.py
657 C:\PP2ndEd\examples\PP2E\tounix.py
0 1
*.html =>
37 C:\PP2ndEd\examples\PP2E\System\Filetools\template.html
37 C:\PP2ndEd\examples\PP2E\System\Filetools\template.html
1 1
*.c =>
46 C:\PP2ndEd\examples\PP2E\Other\old-Integ\embed.c
46 C:\PP2ndEd\examples\PP2E\Other\old-Integ\embed.c
0 1
*.cgi =>
24 C:\PP2ndEd\examples\PP2E\Internet\Cgi-Web\PyMailCgi\onViewSubmit.cgi
24 C:\PP2ndEd\examples\PP2E\Internet\Cgi-Web\PyMailCgi\onViewSubmit.cgi
1 1
* =>
1519 C:\PP2ndEd\examples\PP2E\xferall.linux.csh
1519 C:\PP2ndEd\examples\PP2E\xferall.linux.csh
0 1
 
C:\temp>python %X%\PyTools\find.py *.cxx C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\main.cxx
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\number.cxx
 
C:\temp>python %X%\PyTools\find.py *.asp C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Internet\Other\asp-py.asp
 
C:\temp>python %X%\PyTools\find.py *.i C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Extend\Swig\Environ\environ.i
C:\PP2ndEd\examples\PP2E\Extend\Swig\Shadow\number.i
C:\PP2ndEd\examples\PP2E\Extend\Swig\hellolib.i
 
C:\temp>python %X%\PyTools\find.py setup*.csh C:\PP2ndEd\examples\PP2E
C:\PP2ndEd\examples\PP2E\Config\setup-pp-embed.csh
C:\PP2ndEd\examples\PP2E\Config\setup-pp.csh
C:\PP2ndEd\examples\PP2E\EmbExt\Exports\ClassAndMod\setup-class.csh
C:\PP2ndEd\examples\PP2E\Extend\Swig\setup-swig.csh
 
[filename sort scheme]
C:\temp> python
>>> l = ['ccc', 'bbb', 'aaa', 'aaa.xxx', 'aaa.yyy', 'aaa.xxx.nnn']
>>> l.sort( )
>>> l
['aaa', 'aaa.xxx', 'aaa.xxx.nnn', 'aaa.yyy', 'bbb', 'ccc']

Finally, if an example in this book fails in a future Python release because there is no find to be found, simply change find-module imports in the source code to say from PP2E.PyTools import find instead of import find. The former form will find the custom find module in the book's example package directory tree; the old module in the standard Python library is ignored (if it is still there at all). And if you are brave enough to add the PP2E\PyTools directory itself to your PYTHONPATH setting, all original import find statements will continue to work unchanged.

Better still, do nothing at all -- most find-based examples in this book automatically pick the alternative by catching import exceptions, just in case they aren't located in the PyTools directory:

try:
 import find
except ImportError:
 from PP2E.PyTools import find

The find module may be gone, but it need not be forgotten.

Python Versus csh

If you are familiar with other common shell script languages, it might be useful to see how Python compares. Here is a simple script in a Unix shell language called csh that mails all the files in the current working directory having a suffix of .py (i.e., all Python source files) to a hopefully fictitious address:

#!/bin/csh
foreach x (*.py)
 echo $x
 mail eric@halfabee.com -s $x < $x
end

The equivalent Python script looks similar:

#!/usr/bin/python
import os, glob
for x in glob.glob('*.py'):
 print x
 os.system('mail eric@halfabee.com -s %s < %s' % (x, x))

but is slightly more verbose. Since Python, unlike csh, isn't meant just for shell scripts, system interfaces must be imported, and called explicitly. And since Python isn't just a string-processing language, character strings must be enclosed in quotes as in C.

Although this can add a few extra keystrokes in simple scripts like this, being a general-purpose language makes Python a better tool, once we leave the realm of trivial programs. We could, for example, extend the preceding script to do things like transfer files by FTP, pop up a GUI message selector and status bar, fetch messages from an SQL database, and employ COM objects on Windows -- all using standard Python tools.

Python scripts also tend to be more portable to other platforms than csh. For instance, if we used the Python SMTP interface to send mail rather than relying on a Unix command-line mail tool, the script would run on any machine with Python and an Internet link (as we'll see in Chapter 11, SMTP only requires sockets). And like C, we don't need $ to evaluate variables; what else would you expect in a free language?

 

[1] They may also work their way into your subconscious. Python newcomers sometimes appear on Internet discussion forums to express joy after "dreaming in Python" for the first time. All possible Freudian interpretations aside, it's not bad as dream motifs go; after all, there are worse languages to dream in. [back]

[2] I also wrote the latter as a replacement for the reference appendix that appeared in the first edition of this book; it's meant to be a supplement to the text you're reading. Since I'm its author, though, I won't say more here . . . except that you should be sure to pick up a copy for friends, coworkers, old college roommates, and every member of your extended family the next time you're at the bookstore (yes, I'm kidding). [back]

[3] It's not impossible that Python sees PYTHONPATH differently than you do. A syntax error in your system shell configuration files may botch the setting of PYTHONPATH, even if it looks fine to you. On Windows, for example, if a space appears around the = of a DOS set command in your autoexec.bat file (e.g., set NAME = VALUE), you will actually set NAME to an empty string, not VALUE ! [back]

[4] os.linesep comes back as \015\012 here -- the octal escape code equivalent of \r\n, reflecting the carriage-return + line-feed line terminator convention on Windows. See the discussion of end-of-line translations in Section 2.11 later in this chapter. [back]

[5] The Python execfile built-in function also runs a program file's code, but within the same process that called it. It's similar to an import in that regard, but works more as if the file's text had been pasted into the calling program at the place where the execfile call appears (unless explicit global or local namespace dictionaries are passed). Unlike imports, execfile unconditionally reads and executes a file's code (it may be run more than once per process), and no module object is generated by the file's execution. [back]

[6] For color, these results reflect an old path setting used during development; this variable now contains just the single directory containing the PP2E root. [back]

[7] This is by default. Some program-launching tools also let scripts pass environment settings different from their own to child programs. For instance, the os.spawnve call is like os.spawnv, but accepts a dictionary argument representing the shell environment to be passed to the started program. Some os.exec* variants (ones with an "e" at the end of their names) similarly accept explicit environments; see the os.exec call formats in Chapter 3, for more details. [back]

[8] Notice that raw_input raises an exception to signal end-of-file, but file read methods simply return an empty string for this condition. Because raw_input also strips the end-of-line character at the end of lines, an empty string result means an empty line, so an exception is necessary to specify the end-of-file condition. File read methods retain the end-of-line character, and denote an empty line as \n instead of "". This is one way in which reading sys.stdin directly differs from raw_input. The latter also accepts a prompt string that is automatically printed before input is accepted. [back]

[9] Actually, it gets worse: on the Mac, lines in text files are terminated with a single \r (not \n or \r\n). Whoever said proprietary software was good for the consumer probably wasn't speaking about users of multiple platforms, and certainly wasn't talking about programmers. [back]

[10] For instance, to process pipes, described in Chapter 3. The Python pipe call returns two file descriptors, which can be processed with os module tools or wrapped in a file object with os.fdopen. [back]

[11] To be fair to the built-in file object, the open function accepts a mode "rb+", which is equivalent to the combined mode flags used here, and can also be made nonbuffered with a buffer size argument. Whenever possible, use open, not os.open. [back]

[12] In fact, glob just uses the standard fnmatch module to match name patterns; see the fnmatch description later in this chapter in Section 2.12.3 for more details. [back]

[13] Unlike the re module, fnmatch supports only common Unix shell matching operators, not full-blown regular expression patterns; to understand why this matters, see Chapter 18 for more details. [back]

Chapter 1  TOC Chapter 3