Input/Output and Command-Line Processing (Learning the Korn Shell, 2nd Edition)

The past few chapters have gone into detail about various shell programming techniques, mostly focused on the flow of data and control through shell programs. In this chapter, we'll switch the focus to two related topics. The first is the shell's mechanisms for doing file-oriented input and output. We present information that expands on what you already know about the shell's basic I/O redirectors.

Second, we zoom in and talk about I/O at the line and word level. This is a fundamentally different topic, since it involves moving information between the domains of files/terminals and shell variables. print and command substitution are two ways of doing this that we've seen so far.

Our discussion of line and word I/O then leads into a more detailed explanation of how the shell processes command lines. This information is necessary so that you can understand exactly how the shell deals with quotation, and so that you can appreciate the power of an advanced command called eval, which we cover at the end of the chapter.

7.1. I/O Redirectors

In Chapter 1 you learned about the shell's basic I/O redirectors, <, >, and |. Although these are enough to get you through 95% of your Unix life, you should know that the Korn shell supports a total of 20 I/O redirectors. Table 7-1 lists them, including the three we've already seen. Although some of the rest are useful, others are mainly for systems programmers. We will wait until the next chapter to discuss the last three, which, along with >| and <<<, are not present in most Bourne shell versions.

Table 7-1. I/O redirectors

Redirector	Function
`>` `file`	Direct standard output to file
`<` `file`	Take standard input from file
`cmd1` `\| cmd2`	Pipe; take standard output of cmd1 as standard input to cmd2
`>>` `file`	Direct standard output to file; append to file if it already exists
`>\|` `file`	Force standard output to file even if noclobber is set
`<>` `file`	Open file for both reading and writing on standard input[90]
`<<` `label`	Here-document; see text
`<<-` `label`	Here-document variant; see text
`<<<` `label`	Here-string; see text
`n``> file`	Direct output file descriptor n to file
`n``< file`	Set file as input file descriptor n
`<&``n`	Duplicate standard input from file descriptor n
`>&``n`	Duplicate standard output to file descriptor n
`<&``n``-`	Move file descriptor n to standard input
`>&``n``-`	Move file descriptor n to standard output
`<&-`	Close the standard input
`>&-`	Close the standard output
`\|&`	Background process with I/O from parent shell
`n``<&p`	Move input from coprocess to file descriptor n
`n``>&p`	Move output to coprocess to file descriptor n

[90] Normally, files opened with < are opened read-only.

Notice that some of the redirectors in Table 7-1 contain a digit n and that their descriptions contain the term file descriptor; we'll cover that in a little while. (In fact, any redirector that starts with < or > may be used with a file descriptor; this is omitted from the table for simplicity.)

The first two new redirectors, >> and >|, are simple variations on the standard output redirector >. The >> appends to the output file (instead of overwriting it) if it already exists; otherwise it acts exactly like >. A common use of >> is for adding a line to an initialization file (such as .profile or .mailrc) when you don't want to bother with a text editor. For example:

$ cat >> .mailrc
> alias fred frederick@longmachinename.longcompanyname.com
> ^D
$

As we saw in Chapter 1, cat without an argument uses standard input as its input. This allows you to type the input and end it with CTRL-D on its own line. The alias line will be appended to the file .mailrc if it already exists; if it doesn't, the file is created with that one line.

Recall from Chapter 3 that you can prevent the shell from overwriting a file with > file by typing set -o noclobber. The >| operator overrides noclobber -- it's the "Do it anyway, darn it!" redirector.

Unix systems allow you to open files read-only, write-only, and read-write. The < redirector opens the input file read-only; if a program attempts to write on standard input, it will receive an error. Similarly, the > redirector opens the output file write-only; attempting to read from standard output generates an error. The <> redirector opens a file for both reading and writing, by default on standard input. It is up to the invoked program to notice this and take advantage of the fact, but it is useful in the case where a program may want to update data in a file "in place." This operator is most used for writing networking clients; see Section 7.1.4, later in this chapter for an example.

7.1.1. Here-Documents

The << label redirector essentially forces the input to a command to be the shell program's text, which is read until there is a line that contains only label. The input in between is called a here-document. Here-documents aren't very interesting when used from the command prompt. In fact, it's the same as the normal use of standard input except for the label. We could have used a here-document in the previous example of >>, like this (EOF, for "end of file," is an often-used label):

$ cat >> .mailrc << EOF
> alias fred frederick@longmachinename.longcompanyname.com
> EOF
$

Here-documents are meant to be used from within shell scripts; they let you specify "batch" input to programs. A common use of here-documents is with simple text editors like ed(1). Task 7-1 uses a here-document in this way.

Task 7-1

The s file command in mail(1) saves the current message in file. If the message came over a network (such as the Internet), it has several prepended header lines that give information about network routing. You need this information because you're trying to solve some network routing problems. Write a shell script that extracts just the header lines from the file.

We can use ed to delete the body lines, leaving just the header. To do this, we need to know something about the syntax of mail messages, specifically, that there is always a blank line between the header lines and the message text. The ed command /^$/,$d does the trick: it means, "Delete from the first blank line [91] through the last line of the file." We also need the ed commands w (write the changed file) and q (quit). Here is the code that solves the task:

[91] The line has to be completely empty; no spaces or TABs. That's OK: mail message headers are separated from their bodies by exactly this kind of blank line.

ed $1 << \EOF
/^$/,$d
w
q
EOF

Normally, the shell does parameter (variable) substitution, command substitution, and arithmetic substitution on text in a here-document, meaning that you can use shell variables and commands to customize the text. This evaluation is disabled if any part of the delimiter is quoted, as done in the previous example. (This prevents the shell from treating $d as a variable substitution.)

Often though, you do want the shell to perform its evaluations: perhaps the most common use of here-documents is for providing templates for form generators or program text for program generators. Task 7-2 is a simple task for system administrators that shows how this works.

Task 7-2

Write a script that sends a mail message to a set of users saying that a new version of a certain program has been installed in a certain directory.

You can get a list of all users on the system in various ways; perhaps the easiest is to use cut to extract the first field of /etc/passwd, the file that contains all user account information. Fields in this file are separated by colons (:).[92]

[92] There are a few possible problems with this; for example, /etc/passwd usually contains information on "accounts" that aren't associated with people, like uucp, lp, and daemon. We'll ignore such problems for the purpose of this example.

Given such a list of users, the following code does the trick:

pgmname=$1
for user in $(cut -f1 -d: /etc/passwd); do
    mail $user << EOF
Dear $user,

A new version of $pgmname has been installed in $(whence pgmname).

Regards,
Your friendly neighborhood sysadmin.
EOF
done

The shell substitutes the appropriate values for the name of the program and its directory.

The redirector << has two variations. First, you can prevent the shell from doing parameter, command and arithmetic substitution by surrounding the label in single or double quotes. (Actually, it's enough to quote just one character in the label.) We saw this in the solution to Task 7-1.

The second variation is <<-, which deletes leading TABs (but not spaces) from the here-document and the label line. This allows you to indent the here-document's text, making the shell script more readable:

pgmname=$1
for user in $(cut -f1 -d: /etc/passwd); do
    mail $user <<- EOF
        Dear $user,

        A new version of $pgmname has been installed in $(whence pgmname).

        Regards,

        Your friendly neighborhood sysadmin.
        EOF
done

Of course, you need to choose your label so that it doesn't appear as an actual input line.

7.1.2. Here-Strings

A common idiom in shell programming is to use print to generate some text to be further processed by one or more commands:

# start with a mild interrogation
print -r "$name, $rank, $serial_num" | interrogate -i mild

This could be rewritten to use a here-document, which is slightly more efficient, although not necessarily any easier to read:

# start with a mild interrogation
interrogate -i mild << EOF
$name, $rank, $serial_num
EOF

Starting with ksh93n,[93] the Korn shell provides a new form of here-document, using three less-than signs:

[93] Thanks to David Korn for providing me prerelease access to the version with this feature. ADR.

program <<< WORD

In this form, the text of WORD (followed by a trailing newline) becomes the input to the program. For example:

# start with a mild interrogation
interrogate -i mild <<< "$name, $rank, $serial_num"

This notation first originated in the Unix version of the rc shell, where it is called a "here string." It was later picked up by the Z shell, zsh (see Appendix A), from which the Korn shell borrowed it. This notation is simple, easy to use, efficient, and visually distinguishable from regular here-documents.

7.1.3. File Descriptors

The next few redirectors in Table 7-1 depend on the notion of a file descriptor. This is a low-level Unix I/O concept that is vital to understand when programming in C or C++. It appears at the shell level when you want to do anything that doesn't involve standard input, standard output and standard error. You can get by with a few basic facts about them; for the whole story, look at the open(2), creat(2), read(2), write(2), dup(2), dup2(2), fcntl(2), and close(2) entries in the Unix manual. (As the manual entries are aimed at the C programmer, their relationship to the shell concepts won't necessarily be obvious.)

File descriptors are integers starting at 0 that index an array of file information within a process. When a process starts, it has three file descriptors open. These correspond to the three standards: standard input (file descriptor 0), standard output (1), and standard error (2). If a process opens Unix files for input or output, they are assigned to the next available file descriptors, starting with 3.

By far the most common use of file descriptors with the Korn shell is in saving standard error in a file. For example, if you want to save the error messages from a long job in a file so that they don't scroll off the screen, append 2> file to your command. If you also want to save standard output, append > file1 2> file2.

This leads to Task 7-3.

Task 7-3

You want to start a long job in the background (so that your terminal is freed up) and save both standard output and standard error in a single log file. Write a function that does this.

We'll call this function start. The code is very terse:

function start {
    "$@" > logfile 2>&1 &
}

This line executes whatever command and parameters follow start. (The command cannot contain pipes or output redirectors.) It first sends the command's standard output to logfile.

Then, the redirector 2>&1 says, "Send standard error (file descriptor 2) to the same place as standard output (file descriptor 1)." 2>&1 is actually a combination of two redirectors in Table 7-1: n> file and >&n. Since standard output is redirected to logfile, standard error will go there too. The final & puts the job in the background so that you get your shell prompt back.

As a small variation on this theme, we can send both standard output and standard error into a pipe instead of a file: command 2>&1 | ... does this. (Why this works is described shortly.) Here is a function that sends both standard output and standard error to the logfile (as above) and to the terminal:

function start {
    "$@" 2>&1 | tee logfile &
}

The command tee(1) takes its standard input and copies it to standard output and the file given as argument.

These functions have one shortcoming: you must remain logged in until the job completes. Although you can always type jobs (see Chapter 1) to check on progress, you can't leave your office for the day unless you want to risk a breach of security or waste electricity. We'll see how to solve this problem in Chapter 8.

The other file-descriptor-oriented redirectors (e.g., <&n) are usually used for reading input from (or writing output to) more than one file at the same time. We'll see an example later in this chapter. Otherwise, they're mainly meant for systems programmers, as are <&- (force standard input to close) and >&- (force standard output to close), <&n- (move file descriptor n to standard input) and >&n- (move file descriptor n to standard output).

Finally, we should just note that 0< is the same as <, and 1> is the same as >. (In fact, 0 is the default for any operator that begins with <, and 1 is the default for any operator that begins with >.)

7.1.3.1. Redirector ordering

The shell processes I/O redirections in a specific order. Once you understand how this works, you can take advantage of it, particularly for managing the disposition of standard output and standard error.

The first thing the shell does is set up the standard input and output for pipelines as indicated by the | character. After that, it processes the changing of individual file descriptors. As we just saw, the most common idiom that takes advantage of this is to send both standard output and standard error down the same pipeline to a pager program, such as more or less.[94]

[94] less is a nonstandard but commonly available paging program that has more features than more.

$ mycommand -h fred -w wilma 2>&1 | more

In this example, the shell first sets the standard output of mycommand to be the pipe to more. It then redirects standard error (file descriptor 2) to be the same as standard output (file descriptor 1), i.e., the pipe.

When working with just redirectors, they are processed left-to-right, as they occur on the command line. An example similar to the following has been in the shell man page since the original Version 7 Bourne shell:

program > file1 2>&1          Standard output and standard error to file1
program 2>&1 > file1          Standard error to terminal and standard output to file1

In the first case, standard output is sent to file1, and standard error is then sent to where standard output is, i.e., file1. In the second case, standard error is sent to where standard output is, which is still the terminal. The standard output is then redirected to file1, but only the standard output. If you understand this, you probably know all you need to know about file descriptors.

7.1.4. Special Filenames

Normally, when you provide a pathname after an I/O redirector such as < or >, the shell tries to open an actual file that has the given filename. However, there are two kinds of pathnames where the shell instead treats the pathnames specially.

The first kind of pathname is /dev/fd/N, where N is the file descriptor number of an already open file. For example:

# assume file descriptor 6 is already open on a file
print 'something meaningful' > /dev/fd/6   # same as 1>&6

This works even on systems that don't have a /dev/fd directory. This kind of pathname may also be used with the various file attribute test operators of the [[...]] command.

The second kind of pathname allows access to Internet services via either the TCP or UDP protocol. The pathnames are:

/dev/tcp/host/port: Using TCP, connect to remote host host on remote port port. The host may be given as an IP address in dotted-decimal notation (1.2.3.4) or as a hostname (www.oreilly.com). Similarly, the port for the desired service may be a symbolic name (typically as found in /etc/services) or a numeric port number.[95]

[95] The ability to use hostnames was added in ksh93f; use of service names was added in ksh93m.
/dev/udp/host/port: This is the same, but using UDP.

To use these files for two-way I/O, open a new file descriptor using the exec command (which is described in Chapter 9), using the "read and write" operator, <>. Then use read -u and print -u to read from and write to the new file descriptor. (The read command and the -u option to read and print are described later in this chapter.)

The following example, courtesy of David Korn, shows how to do this. It implements the whois(1) program, which provides information about the registration of Internet domain names:

host=rs.internic.net
port=43
exec 3<> /dev/tcp/$host/$port
print -u3 -f "%s\r\n" "$@"
cat <&3

Using the exec built-in command (see Chapter 9), this program uses the "read-and-write" operator, <>, to open a two-way connection to the host rs.internic.net on TCP port 43, which provides the whois service. (The script could have used port=whois as well.) It then uses the print command to send the argument strings to the whois server. Finally, it reads the returned result using cat. Here is a sample run:

$ whois.ksh kornshell.com

Whois Server Version 1.3

Domain names in the .com, .net, and .org domains can now be registered
with many different competing registrars. Go to http://www.internic.net
for detailed information.

   Domain Name: KORNSHELL.COM
   Registrar: NETWORK SOLUTIONS, INC.
   Whois Server: whois.networksolutions.com
   Referral URL: http://www.networksolutions.com
   Name Server: NS4.PAIR.COM
   Name Server: NS0.NS0.COM
   Updated Date: 02-dec-2001


>>> Last update of whois database: Sun, 10 Feb 2002 05:19:14 EST <<<

The Registry database contains ONLY .COM, .NET, .ORG, .EDU domains and
Registrars.

Network programming is beyond the scope of this book. But for most things, you will probably want to use TCP connections instead of UDP connections if you do write any networking programs in ksh.

Chapter 7. Input/Output and Command-Line Processing

Contents: