File Access (Perl Cookbook, 2nd Edition)

7.0. Introduction

Nothing is more central to data processing than the file. As with everything else in Perl, easy things are easy and hard things are possible. Common tasks (opening files, reading data, writing data) use simple I/O functions and operators, whereas fancier functions do hard things like non-blocking I/O and file locking.

This chapter deals with the mechanics of file access: opening a file, telling subroutines which files to work with, locking files, and so on. Chapter 8 deals with techniques for working with the contents of a file: reading, writing, shuffling lines, and other operations you can do once you have access to the file.

Here's Perl code for printing all lines from the file /usr/local/widgets/data that contain the word "blue":

open(INPUT, "<", "/acme/widgets/data")
    or die "Couldn't open /acme/widgets/data for reading: $!\n";
while (<INPUT>) {
    print if /blue/;
}
close(INPUT);

7.0.1. Getting a Handle on the File

Central to file access in Perl is the filehandle, like INPUT in the previous code example. Filehandles are symbols inside your Perl program that you associate with an external file, usually using the open function. Whenever your program performs an input or output operation, it provides that operation with an internal filehandle, not an external filename. It's the job of open to make that association, and of close to break it. Actually, any of several functions can be used to open files, and handles can refer to entities beyond mere files on disk; see Recipe 7.1 for details.

While users think of open files in terms of those files' names, Perl programs do so using their filehandles. But as far as the operating system itself is concerned, an open file is nothing more than a file descriptor, which is a small, non-negative integer. The fileno function divulges the system file descriptor of its filehandle argument. Filehandles are enough for most file operations, but for when they aren't, Recipe 7.9 turns a system file descriptor into a filehandle you can use from Perl.

Like the names for labels, subroutines, and packages, those for filehandles are unadorned symbols like INPUT, not variables like $input. However, with a few syntactic restrictions, Perl also accepts in lieu of a named filehandle a scalar expression that evaluates to a filehandle—or to something that passes for a filehandle, such as a typeglob, a reference to a typeglob, or an IO object. Typically, this entails storing the filehandle's typeglob in a scalar variable and then using that variable as an indirect filehandle. Code written this way can be simpler than code using named filehandles, because now that you're working with regular variables instead of names, certain untidy and unobvious issues involving quoting, scoping, and packages all become clearer.

As of the v5.6 release, Perl can be coaxed into implicitly initializing variables used as indirect filehandles. If you supply a function expecting to initialize a filehandle (like open) with an undefined scalar, that function automatically allocates an anonymous typeglob and stores its reference into the previously undefined variable—a tongue-twisting description normally abbreviated to something more along the lines of, "Perl autovivifies filehandles passed to open as undefined scalars."

my $input;                            # new lexical starts out undef
open($input, "<", "/acme/widgets/data")
    or die "Couldn't open /acme/widgets/data for reading: $!\n";
while (<$input>) {
    print if /blue/;
}
close($input);                        # also occurs when $input GC'd

For more about references and their autovivification, see Chapter 11. That chapter deals more with customary data references, though, than it does with exotics like the typeglob references seen here.

Having open autovivify a filehandle is only one of several ways to get indirect filehandles. We show different ways of loading up variables with named filehandles and several esoteric equivalents for later use as indirect filehandles in Recipe 7.5.

Some recipes in this chapter use filehandles along with the standard IO::Handle module, and sometimes with the IO::File module. Object constructors from these classes return new objects for use as indirect filehandles anywhere a regular handle would go, such as with built-ins like print, readline, close, <FH>, etc. You can likewise invoke any IO::Handle method on your regular, unblessed filehandles. This includes autovivified handles and even named ones like INPUT or STDIN, although none of these has been blessed as an object.

Method invocation syntax is visually noisier than the equivalent Perl function call, and incurs some performance penalty compared with a function call (where an equivalent function exists). We generally restrict our method use to those providing functionality that would otherwise be difficult or impossible to achieve in pure Perl without resorting to modules.

For example, the blocking method sets or disables blocking on a filehandle, a pleasant alternative to the Fcntl wizardry that at least one of the authors and probably most of the readership would prefer not having to know. This forms the basis of Recipe 7.20.

Most methods are in the IO::Handle class, which IO::File inherits from, and can even be applied directly to filehandles that aren't objects. They need only be something that Perl will accept as a filehandle. For example:

STDIN->blocking(0);                  # invoke on named handle
open($fh, "<", $filename) or die;    # first autovivify handle, then...
$fh->blocking(0);                    # invoke on unblessed typeglob ref

Like most names in Perl, including those of subroutines and global variables, named filehandles reside in packages. That way, two packages can have filehandles of the same name. When unqualified by package, a named filehandle has a full name that starts with the current package. Writing INPUT is really main::INPUT in the main package, but it's SomeMod::INPUT if you're in a hypothetical SomeMod package.

The built-in filehandles STDIN, STDOUT, and STDERR are special. If they are left unqualified, the main package rather than the current one is used. This is the same exception to normal rules for finding the full name that occurs with built-in variables like @ARGV and %ENV, a topic discussed in the Introduction to Chapter 12.

Unlike named filehandles, which are global symbols within the package, autovivified filehandles implicitly allocated by Perl are anonymous (i.e., nameless) and have no package of their own. More interestingly, they are also like other references in being subject to automatic garbage collection. When a variable holding them goes out of scope and no other copies or references to that variable or its value have been saved away somewhere more lasting, the garbage collection system kicks in, and Perl implicitly closes the handle for you (if you haven't yet done so yourself). This is important in large or long-running programs, because the operating system imposes a limit on how many underlying file descriptors any process can have open—and usually also on how many descriptors can be open across the entire system.

In other words, just as real system memory is a finite resource that you can exhaust if you don't carefully clean up after yourself, the same is true of system file descriptors. If you keep opening new filehandles forever without ever closing them, you'll eventually run out, at which point your program will die if you're lucky or careful, and malfunction if you're not. The implicit close during garbage collection of autoallocated filehandles spares you the headaches that can result from less than perfect bookkeeping.

For example, these two functions both autovivify filehandles into distinct lexical variables of the same name:

sub versive {
    open(my $fh, "<", $SOURCE)
        or die "can't open $SOURCE: $!";
    return $fh;
}

sub apparent {
    open(my $fh, ">", $TARGET)
        or die "can't open $TARGET: $!";
    return $fh;
}

my($from, to) = ( versive( ), apparent( ) );

Normally, the handles in $fh would be closed implicitly when each function returns. But since both functions return those values, the handles will stay open a while longer. They remain open until explicitly closed, or until the $from and $to variables and any copies you make all go out of scope—at which point Perl dutifully tidies up by closing them if they've been left open.

For buffered handles with internal buffers containing unwritten data, a more valuable benefit shows up. Because a flush precedes a close, this guarantees that all data finally makes it to where you thought it was going in the first place.[11] For global filehandle names, this implicit flush and close takes place on final program exit, but it is not forgotten.[12]

[11]Or at least tries to; currently, no error is reported if the implicit write syscall should fail at this stage, which might occur if, for example, the filesystem the open file was on has run out of space.

[12]Unless you exit by way of an uncaught signal, either by exec ing another program or by calling POSIX::_exit( ).

7.0.2. Standard Filehandles

Every program starts with three standard filehandles already open: STDIN, STDOUT, and STDERR. STDIN, typically pronounced standard in, represents the default source for data flowing into a program. STDOUT, typically pronounced standard out, represents the default destination for data flowing out from a program. Unless otherwise redirected, standard input will be read directly from your keyboard, and standard output will be written directly to your screen.

One need not be so direct about matters, however. Here we tell the shell to redirect your program's standard input to datafile and its standard output to resultsfile, all before your program even starts:

% program < datafile > resultsfile

Suppose something goes wrong in your program that you need to report. If your standard output has been redirected, the person running your program probably wouldn't notice a message that appeared in this output. These are the precise circumstances for which STDERR, typically pronounced standard error, was devised. Like STDOUT, STDERR is initially directed to your screen, but if you redirect STDOUT to a file or pipe, STDERR's destination remains unchanged. That way you always have a standard way to get warnings or errors through to where they're likely to do some good.

Unlike STDERR for STDOUT, for STDIN there's no preopened filehandle for times when STDIN has been redirected. That's because this need arises much less frequently than does the need for a coherent and reliable diagnostic stream. Rarely, your program may need to ask something of whoever ran it and read their response, even in the face of redirection. The more(1) and less(1) programs do this, for example, because their STDIN s are often pipes from other programs whose long output you want to see a page at a time. On Unix systems, open the special file /dev/tty, which represents the controlling device for this login session. The open fails if the program has no controlling tty, which is the system's way of reporting that there's no one for your program to communicate with.

This arrangement makes it easy to plug the output from one program into the input of the next, and so on down the line.

% first | second | third

That means to apply the first program to the input of the second, and the output of the second as the input of the third. You might not realize it at first, but this is the same logic as seen when stacking functions calls like third(second(first( ))), although the shell's pipeline is a bit easier to read because the transformations proceed from left to right instead of from inside the expression to outside.

Under the uniform I/O interface of standard input and output, each program can be independently developed, tested, updated, and executed without risk of one program interfering with another, but they will still easily interoperate. They act as tools or parts used to build larger constructs, or as separate stages in a larger manufacturing process. Like having a huge stock of ready-made, interchangeable parts on hand, they can be reliably assembled into larger sequences of arbitrary length and complexity. If the larger sequences (call them scripts) are given names by being placed into executable scripts indistinguishable from the store-bought parts, they can then go on to take part in still larger sequences as though they were basic tools themselves.

An environment where every data-transformation program does one thing well and where data flows from one program to the next through redirectable standard input and output streams is one that strongly encourages a level of power, flexibility, and reliability in software design that could not be achieved otherwise. This, in a nutshell, is the so-called tool-and-filter philosophy that underlies the design of not only the Unix shell but the entire operating system. Although problem domains do exist where this model breaks down—and Perl owes its very existence to plugging one of several infelicities the model forces on you—it is a model that has nevertheless demonstrated its fundamental soundness and scalability for nearly 30 years.

7.0.3. I/O Operations

Perl's most common operations for file interaction are open, print, <FH> to read a record, and close. Perl's I/O functions are documented in Chapter 29 of Programming Perl, and in the perlfunc(1) and perlopentut(1) manpages. The next chapter details I/O operations like <FH>, print, seek, and tell. This chapter focuses on open and how you access the data, rather than what you do with the data.

Arguably the most important I/O function is open. You typically pass it two or three arguments: the filehandle, a string containing the access mode indicating how to open the file (for reading, writing, appending, etc.), and a string containing the filename. If two arguments are passed, the second contains both the access mode and the filename jammed together. We use this conflation of mode and path to good effect in Recipe 7.14.

To open /tmp/log for writing and to associate it with the filehandle LOGFILE, say:

open(LOGFILE, "> /tmp/log")     or die "Can't write /tmp/log: $!";

The three most common access modes are < for reading, > for overwriting, and >> for appending. The open function is discussed in more detail in Recipe 7.1. Access modes can also include I/O layers like :raw and :encoding(iso-8859-1). Later in this Introduction we discuss I/O layers to control buffering, deferring until Chapter 8 the use of I/O layers to convert the contents of files as they're read.

When opening a file or making virtually any other system call,[13] checking the return value is indispensable. Not every open succeeds; not every file is readable; not every piece of data you print reaches its destination. Most programmers check open, seek, tell, and close in robust programs. You might want to check other functions, too.

[13]The term system call denotes a call into your operating system kernel. It is unrelated to the C and Perl function that's actually named system. We'll therefore often call these syscalls, after the C and Perl function of that name.

If a function is documented to return an error under such and such conditions, and you don't check for these conditions, then this will certainly come back to haunt you someday. The Perl documentation lists return values from all functions and operators. Pay special attention to the glyph-like annotations in Chapter 29 of Programming Perl that are listed on the righthand side next to each function call entry—they tell you at a glance which variables are set on error and which conditions trigger exceptions.

Typically, a function that's a true system call fails by returning undef, except for wait, waitpid, and syscall, which all return -1 on failure. You can find the system error message as a string and its corresponding numeric code in the $! variable. This is often used in die or warn messages.

The most common input operation in Perl is <FH>, the line input operator. Instead of sitting in the middle of its operands the way infix operators are, the line input operator surrounds its filehandle operand, making it more of a circumfix operator, like parentheses. It's also known as the angle operator because of the left- and right-angle brackets that compose it, or as the readline function, since that's the underlying Perl core function that it calls.

A record is normally a line, but you can change the record terminator, as detailed in Chapter 8. If FH is omitted, it defaults to the special filehandle, ARGV. When you read from this handle, Perl opens and reads in succession data from those filenames listed in @ARGV, or from STDIN if @ARGV is empty. Customary and curious uses of this are described in Recipe 7.14.

At one abstraction level, files are simply streams of octets; that is, of eight-bit bytes. Of course, hardware may impose other organizations, such as blocks and sectors for files on disk or individual IP packets for a TCP connection on a network, but the operating system thankfully hides such low-level details from you.

At a higher abstraction level, files are a stream of logical characters independent of any particular underlying physical representation. Because Perl programs most often deal with text strings containing characters, this is the default set by open when accessing filehandles. See the Introduction to Chapter 8 or Recipe 8.11 for how and when to change that default.

Each filehandle has a numeric value associated with it, typically called its seek offset, representing the position at which the next I/O operation will occur. If you're thinking of files as octet streams, it's how many octets you are from the beginning of the file, with the starting offset represented by 0. This position is implicitly updated whenever you read or write non-zero-length data on a handle. It can also be updated explicitly with the seek function.

Text files are a slightly higher level of abstraction than octet streams. The number of octets need not be identical to the number of characters. Unless you take special action, Perl's filehandles are logical streams of characters, not physical streams of octets. The only time those two numbers (characters and octets) are the same in text files is when each character read or written fits comfortably in one octet (because all code points are below 256), and when no special processing for end of line (such as conversion between "\cJ\cM" and "\n") occurs. Only then do logical character position and physical byte position work out to be the same.

This is the sort of file you have with ASCII or Latin1 text files under Unix, where no fundamental distinction exists between text and binary files, which significantly simplifies programming. Unfortunately, 7-bit ASCII text is no longer prevalent, and even 8-bit encodings of ISO 8859-n are quickly giving way to multibyte-encoded Unicode text.

In other words, because encoding layers such as ":utf8" and translation layers such as ":crlf" can change the number of bytes transferred between your program and the outside world, you cannot sum up how many characters you've transferred to infer your current file position in bytes. As explained in Chapter 1, characters are not bytes—at least, not necessarily and not dependably. Instead, you must use the tell function to retrieve your current file position. For the same reason, only values returned from tell (and the number 0) are guaranteed to be suitable for passing to seek.

In Recipe 7.17, we read the entire contents of a file opened in update mode into memory, change our internal copy, and then seek back to the beginning of that file to write those modifications out again, thereby overwriting what we started with.

When you no longer have use for a filehandle, close it. The close function takes a single filehandle and returns true if the filehandle could be successfully flushed and closed, and returns false otherwise. You don't need to explicitly close every filehandle. When you open a filehandle that's already open, Perl implicitly closes it first. When your program exits, any open filehandles also get closed.

These implicit closes are for convenience, not stability, because they don't tell you whether the syscall succeeded or failed. Not all closes succeed, and even a close on a read-only file can fail. For instance, you could lose access to the device because of a network outage. It's even more important to check the close if the file was opened for writing; otherwise, you wouldn't notice if the filesystem filled up.

close(FH)           or die "FH didn't close: $!";

Closing filehandles as soon as you're done with them can also aid portability to non-Unix platforms, because some have problems in areas such as reopening a file before closing it and renaming or removing a file while it's still open. These operations pose no problem to POSIX systems, but others are less accommodating.

The paranoid programmer even checks the close on standard output stream at the program's end, lest STDOUT had been redirected from the command line and the output filesystem filled up. Admittedly, your runtime system should take care of this for you, but it doesn't.

Checking standard error, though, is more problematic. After all, if STDERR fails to close, what are you planning to do about it? Well, you could determine why the close failed to see whether there's anything you might do to correct the situation. You could even load up the Sys::Syslog module and call syslog( ), which is what system daemons do, since they don't otherwise have access to a good STDERR stream.

STDOUT is the default filehandle used by the print, printf, and write functions if no filehandle argument is passed. Change this default with select, which takes the new default output filehandle and returns the previous one. The new output filehandle must have already been opened before calling select:

$old_fh = select(LOGFILE);                  # switch to LOGFILE for output
print "Countdown initiated ...\n";
select($old_fh);                            # return to original output
print "You have 30 seconds to reach minimum safety distance.\n";

Some of Perl's special variables change the behavior of the currently selected output filehandle. Most important is $|, which controls output buffering for each filehandle. Flushing output buffers is explained in Recipe 7.19.

Perl has functions for buffered and unbuffered I/O. Although there are some exceptions (see the following table), you shouldn't mix calls to buffered and unbuffered I/O functions. That's because buffered functions may keep data in their buffers that the unbuffered functions can't know about. The following table shows the two sets of functions you should not mix. Functions on a particular row are only loosely associated; for instance, sysread doesn't have the same semantics as <FH>, but they are on the same row because they both read from a filehandle. Repositioning is addressed in Chapter 8, but we also use it in Recipe 7.17.

Action	Buffered	Unbuffered
input	<FH>,readline	sysread
output	print	syswrite
repositioning	seek, tell	sysseek

As of Perl v5.8 there is a way to mix these functions: I/O layers. You can't turn on buffering for the unbuffered functions, but you can turn off buffering for the unbuffered ones. Perl now lets you select the implementation of I/O you wish to use. One possible choice is :unix, which makes Perl use unbuffered syscalls rather than your stdio library or Perl's portable reimplementation of stdio called perlio. Enable the unbuffered I/O layer when you open the file with:

open(FH, "<:unix", $filename)  or die;

Having opened the handle with the unbuffered layer, you can now mix calls to Perl's buffered and unbuffered I/O functions with impunity because with that I/O layer, in reality there are no buffered I/O functions. When you print, Perl is then really using the equivalent of syswrite. More information can be found in Recipe 7.19.

Chapter 7. File Access

Contents:

7.0. Introduction

7.0.1. Getting a Handle on the File

7.0.2. Standard Filehandles

7.0.3. I/O Operations