Reading and Writing Files (sed & awk, Second Edition)

5.11.1. Checking Out Reference Pages

Like many programs, a sed script often starts out small, and is simple to write and simple to read. In testing the script, you may discover specific cases for which the general rules do not apply. To account for these, you add lines to your script, making it longer, more complex, and more complete. While the amount of time you spend refining your script may cancel out the time saved by not doing the editing manually, at least during that time your mind has been engaged by your own seeming sleight-of-hand: "See! The computer did it."

We encountered one such problem in preparing a formatted copy of command pages that the writer had typed as a text file without any formatting information. Although the files had no formatting codes, headings were used consistently to identify the format of the command pages. A sample file is shown below.

******************************************************************

NAME:	DBclose - closes a database

SYNTAX:
	void	DBclose(fdesc)
		DBFILE *fdesc;

USAGE:
	fdesc	- pointer to database file descriptor

DESC: 
DBclose() closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.

Other users are not affected when you call DBclose().  Their update
locks and pending writes are not changed.

Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.

DBclose() is analogous to the CLOSE statement in BASIC.

RETURNS:
	There is no return value

******************************************************************

The task was to format this document for the laser printer, using the reference header macros we had developed. Because there were perhaps forty of these command pages, it would have been utter drudgery to go through and add codes by hand. However, because there were that many, and even though the writer was generally consistent in entering them, there would be enough differences from command to command to have required several passes.

We'll examine the process of building this sed script. In a sense, this is a process of looking carefully at each line of a sample input file and determining whether or not an edit must be made on that line. Then we look at the rest of the file for similar occurrences. We try to find specific patterns that mark the lines or range of lines that need editing.

For instance, by looking at the first line, we know we need to eliminate the row of asterisks separating each command. We specify an address for any line beginning and ending with an asterisk and look for zero or more asterisks in between. The regular expression uses an asterisk as a literal and as a metacharacter:

/^\*\**\*$/d

This command deletes entire lines of asterisks anywhere they occur in the file. We saw that blank lines were used to separate paragraphs, but replacing every blank line with a paragraph macro would cause other problems. In many cases, the blank lines can be removed because spacing has been provided in the macro. This is a case where we put off deleting or replacing blank lines on a global basis until we have dealt with specific cases. For instance, some blank lines separate labeled sections, and we can use them to define the end of a range of lines. The script, then, is designed to delete unwanted blank lines as the last operation.

Tabs were a similar problem. Tabs were used to indent syntax lines and in some cases after the colon following a label, such as "NAME". Our first thought was to remove all tabs by replacing them with eight spaces, but there were tabs we wanted to keep, such as those inside the syntax line. So we removed only specific cases, tabs at the beginning of lines and tabs following a colon.

/^•/s///
/:•/s//:/

The next line we come across has the name of the command and a description.

NAME:	DBclose - closes a database

We need to replace it with the macro .Rh 0. Its syntax is:

.Rh 0 "command" "description"

We insert the macro at the beginning of the line, remove the hyphen, and surround the arguments with quotation marks.

/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
	}

We can jump ahead of ourselves a bit here and look at what this portion of our script does to the sample line:

.Rh 0 "DBclose" "closes a database"

The next part that we examine begins with "SYNTAX." What we need to do here is put in the .Rh macro, plus some additional troff requests for indentation, a font change, and no-fill and no-adjust. (The indentation is required because we stripped the tabs at the beginning of the line.) These requests must go in before and after the syntax lines, turning the capabilities on and off. To do this, we define an address that specifies the range of lines between two patterns, the label and a blank line. Then, using the change command, we replace the label and the blank line with a series of formatting requests.

/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
	}

Following the change command, each line of input ends with a backslash except the last line. As a side effect of the change command, the current line is deleted from the pattern space.

The USAGE portion is next, consisting of one or more descriptions of variable items. Here we want to format each item as an indented paragraph with a hanging italicized label. First, we output the .Rh macro; then we search for lines having two parts separated by a tab and a hyphen. Each part is saved, using backslash-parentheses, and recalled during the substitution.

/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
	}

This is a good example of the power of regular expressions. Let's look ahead, once again, and preview the output for the sample.

.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.

The next part we come across is the description. We notice that blank lines are used in this portion to separate paragraphs. In specifying the address for this portion, we use the next label, "RETURNS."

/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/^$/.LP/
}

The first thing we do is insert a paragraph macro because the preceding USAGE section consisted of indented paragraphs. (We could have used the variable-list macros from the -mm package in the USAGE section; if so, we would insert the .LE at this point.) This is done only once, which is why it is keyed to the "DESC" label. Then we substitute the label "DESC" with the .Rh macro and replace all blank lines in this section with a paragraph macro.

When we tested this portion of the sed script on our sample file, it didn't work because there was a single space following the DESC label. We changed the regular expression to look for zero or more spaces following the label. Although this worked for the sample file, there were other problems when we used a larger sample. The writer was inconsistent in his use of the "DESC" label. Mostly, it occurred on a line by itself; sometimes, though, it was included at the start of the second paragraph. So we had to add another pattern to deal with this case. It searches for the label followed by a space and one or more characters.

s/DESC: *$/.Rh Description/
s/DESC: \(.*\)/.Rh Description\
\\1/

In the second case, the reference header macro is output followed by a newline.

The next section, labeled "RETURNS," is handled in the same way as the SYNTAX section.

We do make minor content changes, replacing the label "RETURNS" with "Return Value" and consequently adding this substitution:

s/There is no return value\.*/None./

The very last thing we do is delete remaining blank lines.

/^$/d

Our script is put in a file named refsed. Here it is in full:

# refsed -- add formatting codes to reference pages
/^\*\**\*$/d
/^•/s///
/:•/s//:/
/NAME:/ {
	s//.Rh 0 "/
	s/ - /" "/
	s/$/"/
}
/SYNTAX:/,/^$/ {
	/SYNTAX:/c\
.Rh Syntax\
.in +5n\
.ft B\
.nf\
.na
	/^$/c\
.in -5n\
.ft R\
.fi\
.ad b
}
/USAGE:/,/^$/ {
	/USAGE:/c\
.Rh Usage
	/\(.*\)•- \(.*\)/s//.IP "\\fI\1\\fR" 15n\
\2./
}
/DESC:/,/RETURNS/ {
	/DESC:/i\
.LP
	s/DESC: *$/.Rh Description/
	s/DESC: \(.*\)/.Rh Description\
\1/
	s/^$/.LP/
}
/RETURNS:/,/^$/ {
	/RETURNS:/c\
.Rh "Return Value"
	s/There is no return value\.*/None./
}
/^$/d

As we have remarked, you should not have sed overwrite the original. It is best to redirect the output of sed to another file or let it go to the screen. If the sed script does not work properly, you will find that it is generally easier to change the script and re-run it on the original file than to write a new script to correct the problems caused by a previous run.

$ sed -f refsed refpage  
.Rh 0 "DBclose" "closes a database"
.Rh Syntax
.in +5n
.ft B
.nf
.na
void	DBclose(fdesc)
	DBFILE *fdesc;
.in -5n
.ft R
.fi
.ad b
.Rh Usage
.IP "\fIfdesc\fR" 15n
pointer to database file descriptor.
.LP
.Rh Description
DBclose() closes a file when given its database file descriptor.  
Your pending writes to that file will be completed before the
file is closed.  All of your update locks are removed. 
*fdesc becomes invalid.
.LP
Other users are not effected when you call DBclose().  Their update
locks and pending writes are not changed.
.LP
Note that there is no default file as there is in BASIC.  
*fdesc must specify an open file.
.LP
DBclose() is analogous to the CLOSE statement in BASIC.
.LP
.Rh "Return Value"
None.

5.11. Reading and Writing Files

5.11.1. Checking Out Reference Pages