Substitution (sed & awk, Second Edition)

5.3.1.1. Correcting index entries

Later, in the awk section of this book, we will present a program for formatting an index, such as the one for this book. The first step in creating an index is to place index codes in the document files. We use an index macro named .XX, which takes a single argument, the index entry. A sample index entry might be:

.XX "sed, substitution command"

Each index entry appears on a line by itself. When you run an index, you get a collection of index entries with page numbers that are then sorted and merged in a list. An editor poring over that list will typically find errors and inconsistencies that need to be corrected. It is, in short, a pain to have to track down the file where an index entry resides and then make the correction, particularly when there are dozens of entries to be corrected.

Sed can be a great help in making these edits across a group of files. One can simply create a list of edits in a sed script and then run it on all the files. A key point is that the substitute command needs an address that limits it to lines beginning ".XX". Your script should not make changes in the text itself.

Let's say that we wanted to change the index entry above to "sed, substitute command." The following command would do it:

/^\.XX /s/sed, substitution command/sed, substitute command/

The address matches all lines that begin with ".XX " and only on those lines does it attempt to make the replacement. You might wonder, why not specify a shorter regular expression? For example:

/^\.XX /s/substitution/substitute/

The answer is simply that there could be other entries which use the word "substitution" correctly and which we would not want to change.

We can go a step further and provide a shell script that creates a list of index entries prepared for editing as a series of sed substitute commands.

#! /bin/sh
# index.edit -- compile list of index entries for editing.
grep "^\.XX" $* | sort -u |
sed '
s/^\.XX \(.*\)$/\/^\\.XX \/s\/\1\/\1\//'

The index.edit shell script uses grep to extract all lines containing index entries from any number of files specified on the command line. It passes this list through sort which, with the -u option, sorts and removes duplicate entries. The list is then piped to sed, and the one-line sed script builds a substitution command.

Let's look at it more closely. Here's just the regular expression:

^\.XX \(.*\)$

It matches the entire line, saving the index entry for recall. Here's just the replacement string:

\/^\\.XX \/s\/\1\/\1\/

It generates a substitute command beginning with an address: a slash, followed by two backslashes--to output one backslash to protect the dot in the ".XX" that follows--then comes a space, then another slash to complete the address. Next we output an "s" followed by a slash, and then recall the saved portion to be used as a regular expression. That is followed by another slash and again we recall the saved substring as the replacement string. A slash finally ends the command.

When the index.edit script is run on a file, it creates a listing similar to this:

$ index.edit ch05
/^\.XX /s/"append command(a)"/"append command(a)"/
/^\.XX /s/"change command"/"change command"/
/^\.XX /s/"change command(c)"/"change command(c)"/
/^\.XX /s/"commands:sed, summary of"/"commands:sed, summary of"/
/^\.XX /s/"delete command(d)"/"delete command(d)"/
/^\.XX /s/"insert command(i)"/"insert command(i)"/
/^\.XX /s/"line numbers:printing"/"line numbers:printing"/
/^\.XX /s/"list command(l)"/"list command(l)"/

This output could be captured in a file. Then you can delete the entries that don't need to change and you can make changes by editing the replacement string. At that point, you can use this file as a sed script to correct the index entries in all document files.

When doing a large book with lots of entries, you might use grep again to extract particular entries from the output of index.edit and direct them into their own file for editing. This saves you from having to wade through numerous entries.

There is one small failing in this program. It should look for metacharacters that might appear literally in index entries and protect them in regular expressions. For instance, if an index entry contains an asterisk, it will not be interpreted as such, but as a metacharacter. To make that change effectively requires the use of several advanced commands, so we'll put off improving this script until the next chapter.

5.3. Substitution

5.3.1. Replacement Metacharacters

5.3.1.1. Correcting index entries