[Chapter 34] 34.15 Making Edits Across Line Boundaries

34.15 Making Edits Across Line Boundaries

Most programs that use regular expressions ( 26.4 ) are able to match a pattern only on a single line of input. This makes it difficult to find or change a phrase, for instance, because it can start near the end of one line and finish near the beginning of the next line. Other patterns might be significant only when repeated on multiple lines.

sed has the ability to load more than one line into the pattern space. This allows you to match (and change) patterns that extend over multiple lines. In this article, we show how to create a multiline pattern space and manipulate its contents.

The multiline Next command, N , creates a multiline pattern space by reading a new line of input and appending it to the contents of the pattern space. The original contents of the pattern space and the new input line are separated by a newline. The embedded newline character can be matched in patterns by the escape sequence \n . In a multiline pattern space, only the metacharacter ^ matches the newline at the beginning of the pattern space and $ matches the newline at the end. After the Next command is executed, control is then passed to subsequent commands in the script.

The Next command differs from the next command, n , which outputs the contents of the pattern space and then reads a new line of input. The next command does not create a multiline pattern space.

For our first example, let's suppose that we wanted to change "Owner and Operator Guide" to "Installation Guide" but we found that it appears in the file on two lines, splitting between Operator and Guide . For instance, here are a few lines of sample text:

Consult Section 3.1 in the Owner and Operator
Guide for a description of the tape drives
available on your system.

The following script looks for Operator at the end of a line, reads the next line of input, and then makes the replacement:

/Operator$/{
    N
    s/Owner and Operator\nGuide/Installation Guide/
}

In this example, we know where the two lines split and where to specify the embedded newline. When the script is run on the sample file, it produces the two lines of output, one of which combines the first and second lines and is too long to show here. This happens because the substitute command matches the embedded newline but does not replace it. Unfortunately, you cannot use \n to insert a newline in the replacement string. You must either use the backslash to escape the newline, as follows:

s/Owner and Operator\nGuide /Installation Guide\
/

or use the $ ..$ operators ( 34.10 ) to keep the newline:

s/Owner and Operator\(\n\)Guide /Installation Guide\1/

This command restores the newline after Installation Guide . It is also necessary to match a blank space following Guide so the new line won't begin with a space. Now we can show the output:

Consult Section 3.1 in the Installation Guide 
for a description of the tape drives
available on your system.

Remember, you don't have to replace the newline, but if you don't, it can make for some long lines.

What if there are other occurrences of "Owner and Operator Guide" that break over multiple lines in different places? You could change the address to match Owner , the first word in the pattern instead of the last, and then modify the regular expression to look for a space or a newline between words, as shown below:

/Owner/{
N
s/Owner *\n*and *\n*Operator *\n*Guide/Installation Guide/
}

The asterisk (* ) indicates that the space or newline is optional. This seems like hard work though, and indeed there is a more general way. We can read the newline into the pattern space and then use a substitute command to remove the embedded newline, wherever it is:

s/Owner and Operator Guide/Installation Guide/
/Owner/{
N
s/ *\n/ /
s/Owner and Operator Guide */Installation Guide\
/
}

The first line of the script matches Owner and Operator Guide when it appears on a line by itself. (See the discussion at the end of the article about why this is necessary.) If we match the string Owner , we read the next line into the pattern space and replace the embedded newline with a space. Then we attempt to match the whole pattern and make the replacement followed by a newline. This script will match Owner and Operator Guide regardless of how it is broken across two lines. Here's our expanded test file:

Consult Section 3.1 in the Owner and Operator
Guide for a description of the tape drives
available on your system.

Look in the Owner and Operator Guide shipped with your system.

Two manuals are provided, including the Owner and
Operator Guide and the User Guide.

The Owner and Operator Guide is shipped with your system.

Running the above script on the sample file produces the following result:

% 

sed -f sedscr sample


Consult Section 3.1 in the Installation Guide
for a description of the tape drives
available on your system.

Look in the Installation Guide shipped with your system.

Two manuals are provided, including the Installation Guide
and the User Guide.

The Installation Guide is shipped with your system.

In this sample script, it might seem redundant to have two substitute commands that match the pattern. The first command matches it when the pattern is found already on one line, and the second matches the pattern after two lines have been read into the pattern space. Why the first command is necessary is perhaps best demonstrated by removing that command from the script and running it on the sample file:

% 

sed -f sedscr2 sample


Consult Section 3.1 in the Installation Guide
for a description of the tape drives
available on your system.

Look in the Installation Guide
shipped with your system.
Two manuals are provided, including the Installation Guide
and the User Guide.

Do you see the two problems? The most obvious problem is that the last line did not print. The last line matches Owner , and when N is executed, there is not another input line to read, so sed quits. It does not even output the line. If this is the normal behavior, the Next command should be used as follows to be safe:

$!N

It excludes the last line ($ ) from the Next command. As it is in our script, by matching Owner and Operator Guide on the last line, we avoid matching Owner and applying the N command. However, if the word Owner appeared on the last line we'd have the same problem unless we implement the $!N syntax.

The second problem is a little less conspicuous. It has to do with the occurrence of Owner and Operator Guide in the second paragraph. In the input file, it is found on a line by itself:

Look in the Owner and Operator Guide shipped with your system.

In the output shown above, the blank line following shipped with your system is missing. The reason for this is that this line matches Owner and the next line, a blank line, is appended to the pattern space. The substitute command removes the embedded newline, and the blank line has in effect vanished. (If the line were not blank, the newline would still be removed but the text would appear on the same line with shipped with your system .) The best solution seems to be to avoid reading the next line when the pattern can be matched on one line. So, that is why the first instruction attempts to match the case where the string appears all on one line.

- DD from O'Reilly & Associates' sed & awk , Chapter 6