[Chapter 5] 5.3 Substitution

5.3 Substitution

We have already demonstrated many uses of the substitute command. Let's look carefully at its syntax:

[address ]s /pattern /replacement /flags

where the flags that modify the substitution are:

n: A number (1 to 512) indicating that a replacement should be made for only the n th occurrence of the pattern .
g: Make changes globally on all occurrences in the pattern space. Normally only the first occurrence is replaced.
p: Print the contents of the pattern space.
w file: Write the contents of the pattern space to file .

The substitute command is applied to the lines matching the address . If no address is specified, it is applied to all lines that match the pattern , a regular expression. If a regular expression is supplied as an address, and no pattern is specified, the substitute command matches what is matched by the address. This can be useful when the substitute command is one of multiple commands applied at the same address. For an example, see the section "Checking Out Reference Pages" later in this chapter.

Unlike addresses, which require a slash (/) as a delimiter, the regular expression can be delimited by any character except a newline. Thus, if the pattern contained slashes, you could choose another character, such as an exclamation mark, as the delimiter.

s!/usr/mail!/usr2/mail!

Note that the delimiter appears three times and is required after the replacement . Regardless of which delimiter you use, if it does appear in the regular expression, or in the replacement text, use a backslash (\) to escape it.

Once upon a time, computers stored text in fixed-length records. A line ended after so many characters (typically 80), and then the next line started. There was no explicit character in the data to mark the end of one line and the beginning of the next; every line had the same (fixed) number of characters. Modern systems are more flexible; they use a special character (referred to as newline ) to mark the end of the line. This allows lines to be of arbitrary[3] length.

[3] Well, more or less. Many UNIX programs have internal limits on the length of the lines that they will process. Most GNU programs, though, do not have such limits.

Since newline is just another character when stored internally, a regular expression can use "\n" to match an embedded newline. This occurs, as you will see in the next chapter, in the special case when another line is appended to the current line in the pattern space. (See Chapter 2, Understanding Basic Operations , for a discussion of line addressing and Chapter 3, Understanding Regular Expression Syntax , for a discussion of regular expression syntax.)

The replacement is a string of characters that will replace what is matched by the regular expression. (See the section "The Extent of the Match" in Chapter 3.) In the replacement section, only the following characters have special meaning:

&: Replaced by the string matched by the regular expression.
\ n: Matches the n th substring ( n is a single digit) previously specified in the pattern using "\(" and "\)".
\: Used to escape the ampersand (&), the backslash (\), and the substitution command's delimiter when they are used literally in the replacement section. In addition, it can be used to escape the newline and create a multiline replacement string.

Thus, besides metacharacters in regular expressions, sed also has metacharacters in the replacement. See the next section, "Replacement Metacharacters," for examples of using them.

Flags can be used in combination where it makes sense. For instance, gp makes the substitution globally on the line and prints the line. The global flag is by far the most commonly used. Without it, the replacement is made only for the first occurrence on the line. The print flag and the write flag both provide the same functionality as the print and write commands (which are discussed later in this chapter) with one important difference. These actions are contingent upon a successful substitution occurring. In other words, if the replacement is made, the line is printed or written to file. Because the default action is to pass through all lines, regardless of whether any action is taken, the print and write flags are typically used when the default output is suppressed (the -n option). In addition, if a script contains multiple substitute commands that match the same line, multiple copies of that line will be printed or written to file.

The numeric flag can be used in the rare instances where the regular expression repeats itself on a line and the replacement must be made for only one of those occurrences by position. For instance, a line, perhaps containing tbl input, might contain multiple tabs. Let's say that there are three tabs per line, and you'd like to replace the second tab with ">". The following substitute command would do it:

s//>/2

"" represents an actual tab character, which is otherwise invisible on the screen. If the input is a one-line file such as the following:

Column1Column2Column3Column4

the output produced by running the script on this file will be:

Column1Column2>Column3Column4

Note that without the numeric flag, the substitute command would replace only the first tab. (Therefore "1" can be considered the default numeric flag.)

5.3.1 Replacement Metacharacters

The replacement metacharacters are backslash (\), ampersand (&), and \ n . The backslash is generally used to escape the other metacharacters but it is also used to include a newline in a replacement string.

We can do a variation on the previous example to replace the second tab on each line with a newline.

s//\
/2

Note that no spaces are permitted after the backslash. This script produces the following result:

Column1Column2
Column3Column4

Another example comes from the conversion of a file for troff to an ASCII input format for Ventura Publisher. It converts the following line for troff :

.Ah "Major Heading"

to a similar line for Ventura Publisher:

@A HEAD = Major Heading

The twist in this problem is that the line needs to be preceded and followed by blank lines. It is an example of writing a multiline replacement string.

/^\.Ah/{
s/\.Ah */\
\
@A HEAD = /
s/"//g
s/$/\
/    
}

The first substitute command replaces ".Ah" with two newlines and "@A HEAD =". A backslash at the end of the line is necessary to escape the newline. The second substitution removes the quotation marks. The last command matches the end of line in the pattern space (not the embedded newlines) and adds a newline after it.

In the next example, the backslash is used to escape the ampersand, which appears literally in the replacement section.

s/ORA/O'Reilly \& Associates, Inc./g

It's easy to forget about the ampersand appearing literally in the replacement string. If we had not escaped it in this example, the output would have been "O'Reilly ORA Associates, Inc."

As a metacharacter, the ampersand (&) represents the extent of the pattern match, not the line that was matched. You might use the ampersand to match a word and surround it by troff requests. The following example surrounds a word with point-size requests:

s/UNIX/\\s-2&\\s0/g

Because backslashes are also replacement metacharacters, two backslashes are necessary to output a single backslash. The "&" in the replacement string refers to "UNIX." If the input line is:

on the UNIX Operating System.

then the substitute command produces:

on the \s-2UNIX\s0 Operating System.

The ampersand is particularly useful when the regular expression matches variations of a word. It allows you to specify a variable replacement string that corresponds to what was actually matched. For instance, let's say that you wanted to surround with parentheses any cross reference to a numbered section in a document. In other words, any reference such as "See Section 1.4" or "See Section 12.9" should appear in parentheses, as "(See Section 12.9)." A regular expression can match the different combination of numbers, so we use "&" in the replacement string and surround whatever was matched.

s/See Section [1-9][0-9]*\.[1-9][0-9]*/(&)/

The ampersand makes it possible to reference the entire match in the replacement string.

Now let's look at the metacharacters that allow us to select any individual portion of a string that is matched and recall it in the replacement string. A pair of escaped parentheses are used in sed to enclose any part of a regular expression and save it for recall. Up to nine "saves" are permitted for a single line. " \n " is used to recall the portion of the match that was saved, where n is a number from 1 to 9 referencing a particular "saved" string in order of use.

For example, to put the section numbers in boldface when they appeared as a cross reference, we could write the following substitution:

s/\(See Section \)\([1-9][0-9]*\.[1-9][0-9]*\)/\1\\fB\2\\fP/

Two pairs of escaped parentheses are specified. The first captures "See Section" (because this is a fixed string, it could have been simply retyped in the replacement string). The second captures the section number. The replacement string recalls the first saved substring as "\1" and the second as "\2," which is surrounded by bold-font requests.

We can use a similar technique to match parts of a line and swap them. For instance, let's say there are two parts of a line separated by a colon. We can match each part, putting them within escaped parentheses and swapping them in the replacement.

$

 cat test1


first:second
one:two
$ 

sed  's/\(.*\):\(.*\)/\2:\1/' test1


second:first
two:one

The larger point is that you can recall a saved substring in any order, and multiple times, as you'll see in the next example.

5.3.1.1 Correcting index entries

Later, in the awk section of this book, we will present a program for formatting an index, such as the one for this book. The first step in creating an index is to place index codes in the document files. We use an index macro named .XX, which takes a single argument, the index entry. A sample index entry might be:

.XX "sed, substitution command"

Each index entry appears on a line by itself. When you run an index, you get a collection of index entries with page numbers that are then sorted and merged in a list. An editor poring over that list will typically find errors and inconsistencies that need to be corrected. It is, in short, a pain to have to track down the file where an index entry resides and then make the correction, particularly when there are dozens of entries to be corrected.

Sed can be a great help in making these edits across a group of files. One can simply create a list of edits in a sed script and then run it on all the files. A key point is that the substitute command needs an address that limits it to lines beginning ".XX". Your script should not make changes in the text itself.

Let's say that we wanted to change the index entry above to "sed, substitute command." The following command would do it:

/^\.XX /s/sed, substitution command/sed, substitute command/

The address matches all lines that begin with ".XX " and only on those lines does it attempt to make the replacement. You might wonder, why not specify a shorter regular expression? For example:

/^\.XX /s/substitution/substitute/

The answer is simply that there could be other entries which use the word "substitution" correctly and which we would not want to change.

We can go a step further and provide a shell script that creates a list of index entries prepared for editing as a series of sed substitute commands.

#! /bin/sh
# index.edit -- compile list of index entries for editing.
grep "^\.XX" $* | sort -u |
sed '
s/^\.XX \(.*\)$/\/^\\.XX \/s\/\1\/\1\//'

The index.edit shell script uses grep to extract all lines containing index entries from any number of files specified on the command line. It passes this list through sort which, with the -u option, sorts and removes duplicate entries. The list is then piped to sed, and the one-line sed script builds a substitution command.

Let's look at it more closely. Here's just the regular expression:

^\.XX \(.*\)$

It matches the entire line, saving the index entry for recall. Here's just the replacement string:

\/^\\.XX \/s\/\1\/\1\/

It generates a substitute command beginning with an address: a slash, followed by two backslashes - to output one backslash to protect the dot in the ".XX" that follows - then comes a space, then another slash to complete the address. Next we output an "s" followed by a slash, and then recall the saved portion to be used as a regular expression. That is followed by another slash and again we recall the saved substring as the replacement string. A slash finally ends the command.

When the index.edit script is run on a file, it creates a listing similar to this:

$ 

index.edit ch05


/^\.XX /s/"append command(a)"/"append command(a)"/
/^\.XX /s/"change command"/"change command"/
/^\.XX /s/"change command(c)"/"change command(c)"/
/^\.XX /s/"commands:sed, summary of"/"commands:sed, summary of"/
/^\.XX /s/"delete command(d)"/"delete command(d)"/
/^\.XX /s/"insert command(i)"/"insert command(i)"/
/^\.XX /s/"line numbers:printing"/"line numbers:printing"/
/^\.XX /s/"list command(l)"/"list command(l)"/

This output could be captured in a file. Then you can delete the entries that don't need to change and you can make changes by editing the replacement string. At that point, you can use this file as a sed script to correct the index entries in all document files.

When doing a large book with lots of entries, you might use grep again to extract particular entries from the output of index.edit and direct them into their own file for editing. This saves you from having to wade through numerous entries.

There is one small failing in this program. It should look for metacharacters that might appear literally in index entries and protect them in regular expressions. For instance, if an index entry contains an asterisk, it will not be interpreted as such, but as a metacharacter. To make that change effectively requires the use of several advanced commands, so we'll put off improving this script until the next chapter.