home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Perl Cookbook

Perl CookbookSearch this book
Previous: 6.23. Regular Expression Grabbag Chapter 7 Next: 7.1. Opening a File
 

7. File Access

I the heir of all the ages, in the foremost files of time.

- Alfred, Lord Tennyson Locksley Hall

7.0. Introduction

Nothing is more central to data processing than the file. As with everything else in Perl, easy things are easy and hard things are possible. Common tasks (opening, reading data, writing data) use simple I/O functions and operators, whereas fancier functions do hard things like non-blocking I/O and file locking.

This chapter deals with the mechanics of file access : opening a file, telling subroutines which files to work with, locking files, and so on. Chapter 8, File Contents , deals with techniques for working with the contents of a file: reading, writing, shuffling lines, and other operations you can do once you have access to the file.

Here's Perl code for printing all lines in the file /usr/local/widgets/data that contain the word "blue" :

open(INPUT, "< /usr/local/widgets/data")
    or die "Couldn't open /usr/local/widgets/data for reading: $!\n";

while (<INPUT>) {
    print if /blue/;
}
close(INPUT);

Getting a Handle on the File

Central to Perl's file access is the filehandle , like INPUT in the preceding program. This is a symbol you use to represent the file when you read and write. Because filehandles aren't variables (they don't have a $ , @ , or % type marker on their names  - but they are part of Perl's symbol table just as subroutines and variables are), storing filehandles in variables and passing them to subroutines won't always work. You should use the odd-looking *FH notation, indicating a typeglob, the basic unit of Perl's symbol table:

$var = *STDIN;
mysub($var, *LOGFILE);

When you store filehandles in variables like this, you don't use them directly. They're called indirect filehandles because they indirectly refer to the real filehandle. Two modules, IO::File (standard since 5.004) and FileHandle (standard since 5.000), can create anonymous filehandles.

When we use IO::File or IO::Handle in our examples, you could obtain identical results by using FileHandle instead, since it's now just a wrapper module.

Here's how we'd write the "blue" -finding program with the IO::File module using purely object-oriented notation:

use IO::File;

$input = IO::File->new("< /usr/local/widgets/data")
    or die "Couldn't open /usr/local/widgets/data for reading: $!\n";

while (defined($line = $input->getline())) {
    chomp($line);
    STDOUT->print($line) if $line =~ /blue/;
}
$input->close();

As you see, it's much more readable to use filehandles directly. It's also a lot faster.

But here's a little secret for you: you can skip all that arrow and method-call business altogether. Unlike most objects, you don't actually have to use IO::File objects in an object-oriented way. They're essentially just anonymous filehandles, so you can use them anywhere you'd use a regular indirect filehandle. Recipe 7.16 covers these modules and the *FH notation. We use both IO::File and symbolic filehandles in this chapter.

Standard FileHandles

Every program starts out with three global filehandles already opened: STDIN, STDOUT, and STDERR. STDIN ( standard input ) is the default source of input, STDOUT ( standard output ) is the default destination for output, and STDERR ( standard error ) is the default place to send warnings and errors. For interactive programs, STDIN is the keyboard, STDOUT and STDERR are the screen:

while (<STDIN>) {                   # reads from STDIN
    unless (/\d/) {
        warn "No digit found.\n";   # writes to STDERR
    }
    print "Read: ", $_;             # writes to STDOUT
}
END { close(STDOUT)                 or die "couldn't close STDOUT: $!" }

Filehandles live in packages. That way, two packages can have filehandles with the same name and be separate, just as they can with subroutines and variables. The open function associates a filehandle with a file or program, after which you use that filehandle for I/O. When done, close the filehandle to break the association.

Files are accessed at the operating system through numeric file descriptors. You can learn a filehandle's descriptor number using the fileno function. Perl's filehandles are sufficient for most file operations, but Recipe 7.19 tells you how to deal with the situation where you're given an file descriptor and want to turn it into a filehandle you can use.

I/O Operations

Perl's most common operations for file interaction are open , print , <FH> to read a record, and close . These are wrappers around routines from the C buffered input/output library called stdio . Perl's I/O functions are documented in Chapter 3 of Programming Perl , perlfunc (1), and your system's stdio (3S) manpages. The next chapter details I/O operations like <>, print , seek , and tell .

The most important I/O function is open . It takes two arguments, the filehandle and a string containing the filename and access mode. To open /tmp/log for writing and to associate it with the filehandle LOGFILE , say:

open(LOGFILE, "> /tmp/log")     or die "Can't write /tmp/log: $!";

The three most common access modes are < for reading, > for overwriting, and >> for appending. The open function is discussed in more detail in Recipe 7.1 .

When opening a file or making virtually any other system call,[ 1 ] checking the return value is indispensable. Not every open succeeds; not every file is readable; not every piece of data you print can reach its destination. Most programmers check open , seek , tell , and close in robust programs. You might also want to check other functions. The Perl documentation lists return values from all functions and operators. If a system call fails, it returns undef , except for wait , waitpid , and syscall , which return -1 on failure. The system error message or number is available in the $! variable. This is often used in die or warn messages.

[1] The term system call denotes a call into your operating system. It is unrelated to the C and Perl function that's actually named system .

To read a record in Perl, use the circumfix operator <FILEHANDLE> , whose behavior is also available through the readline function. A record is normally a line, but you can change the record terminator, as detailed in Chapter 8 . If FILEHANDLE is omitted, Perl opens and reads from the filenames in @ARGV or from STDIN if there aren't any. Customary and curious uses of this are described in Recipe 7.7 .

Abstractly, files are simply streams of bytes. Each filehandle has associated with it a number representing the current byte position in the file, returned by the tell function and set by the seek function. In Recipe 7.10 , we rewrite a file without closing and reopening by using seek to move back to the start, rewinding it.

When you no longer have use for a filehandle, close it. The close function takes a single filehandle and returns true if the filehandle could be successfully flushed and closed, false otherwise. You don't need to explicitly close every filehandle. When you open a filehandle that's already open, Perl implicitly closes it first. When your program exits, any open filehandles also get closed.

These implicit closes are for convenience, not stability, because they don't tell you whether the system call succeeded or failed. Not all closes succeed. Even a close on a read-only file can fail. For instance, you could lose access to the device because of a network outage. It's even more important to check the close if the file was opened for writing. Otherwise you wouldn't notice if the disk filled up.

close(FH)           or die "FH didn't close: $!";

The prudent programmer even checks the close on standard output stream at the program's end, in case STDOUT was redirected from the command line the output filesystem filled up. Admittedly, your run-time system should take care of this for you, but it doesn't.

Checking standard error, though, is probably of dubious value. After all, if STDERR fails to close, what are you planning to do about it?

STDOUT is the default destination for output from the print , printf , and write functions. Change this with select , which takes the new default output filehandle and returns the previous one. The new output filehandle should have been opened before calling select :

$old_fh = select(LOGFILE);                  # switch to LOGFILE for output
print "Countdown initiated ...\n";
select($old_fh);                            # return to original output
print "You have 30 seconds to reach minimum safety distance.\n";

Some of Perl's special variables change the behavior of the currently selected output filehandle. Most important is $| , which controls output buffering for each filehandle. Buffering is explained in Recipe 7.12 .

Perl provides functions for buffered and unbuffered input and output. Although there are some exceptions, you shouldn't mix calls to buffered and unbuffered I/O functions. The following table shows the two sets of functions you should not mix. Functions on a particular row are only loosely associated; for instance, sysread doesn't have the same semantics as < > , but they are on the same row because they both read input from a filehandle.

Action

Buffered

Unbuffered

opening

open,sysopen

sysopen

closing

close

close

input

<FILE>,readline

sysread

output

print

syswrite

repositioning

seek, tell

sysseek

Repositioning is addressed in Chapter 8 , but we also use it in Recipe 7.10 .