[Chapter 6] 6.3 Pattern-Matching Rules

6.3 Pattern-Matching Rules

In making global replacements, UNIX editors such as vi allow you to search not just for fixed strings of characters, but also for variable patterns of words, referred to as regular expressions .

When you specify a literal string of characters, the search might turn up other occurrences that you didn't want to match. The problem with searching for words in a file is that a word can be used in different ways. Regular expressions help you conduct a search for words in context. Note that regular expressions can be used with the vi search commands / and ? as well as in the ex :g and :s commands.

For the most part, the same regular expressions work with other UNIX programs such as grep , sed , and awk .[2 ]

[2] Much more information on regular expressions can be found in the two O'Reilly books sed & awk , by Dale Dougherty and Arnold Robbins, and Mastering Regular Expressions , by Jeffrey E.F. Friedl.

Regular expressions are made up by combining normal characters with a number of special characters called metacharacters .[3 ] The metacharacters and their uses are listed below.

[3] Technically speaking, we should probably call these metasequences , since sometimes two characters together have special meaning, and not just single characters. Nevertheless, the term metacharacters is in common use in UNIX literature, so we follow that convention here.

6.3.1 Metacharacters Used in Search Patterns

.

Matches any single character except a newline. Remember that spaces are treated as characters. For example, p.p matches character strings such as pep , pip , and pcp .

*

Matches zero or more (as many as there are) of the single character that immediately precedes it. For example, bugs* will match bugs (one s ) or bug (no s 's).

The * can follow a metacharacter. For example, since . (dot) means any character, .* means "match any number of any character."

Here's a specific example of this. The command :s/End.*/End/ removes all characters after End (it replaces the remainder of the line with nothing).

^

When used at the start of a regular expression, requires that the following regular expression be found at the beginning of the line; for example, ^Part matches Part when it occurs at the beginning of a line, and ^... matches the first three characters of a line. When not at the beginning of a regular expression, ^ stands for itself.

$

When used at the end of a regular expression, requires that the preceding regular expression be found at the end of the line; for example, here:$ matches only when here: occurs at the end of a line. When not at the end of a regular expression, $ stands for itself.

\

Treats the following special character as an ordinary character. For example, \. matches an actual period instead of "any single character," and \* matches an actual asterisk instead of "any number of a character." The \ (backslash) prevents the interpretation of a special character. This prevention is called "escaping the character." (Use \\ to get a literal backslash.)

[ ]

Matches any one of the characters enclosed between the brackets. For example, [AB] matches either A or B , and p[aeiou]t matches pat , pet , pit , pot , or put . A range of consecutive characters can be specified by separating the first and last characters in the range with a hyphen. For example, [A-Z] will match any uppercase letter from A to Z , and [0-9] will match any digit from 0 to 9 .

You can include more than one range inside brackets, and you can specify a mix of ranges and separate characters. For example, [:;A-Za-z()] will match four different punctuation marks, plus all letters.

Most metacharacters lose their special meaning inside brackets, so you don't need to escape them if you want to use them as ordinary characters. Within brackets, the three metacharacters you still need to escape are \ - ] . The hyphen (- ) acquires meaning as a range specifier; to use an actual hyphen, you can also place it as the first character inside the brackets.

A caret (^ ) has special meaning only when it is the first character inside the brackets, but in this case the meaning differs from that of the normal ^ metacharacter. As the first character within brackets, a ^ reverses their sense: the brackets will match any one character not in the list. For example, [^a-z] matches any character that is not a lowercase letter.

Saves the pattern enclosed between $ and $ into a special holding space or "hold buffer." Up to nine patterns can be saved in this way on a single line. For example, the pattern:

\(That\) or \(this\)

saves That in hold buffer number 1 and saves this in hold buffer number 2. The patterns held can be "replayed" in substitutions by the sequences \1 to \9 . For example, to rephrase That or this to read this or That , you could enter:

:%s/\(That\) or \(this\)/\2 or \1/

You can also use the \ n notation within a search or substitute string:

:s/\(abcd\)\1/alphabet-soup/

changes abcdabcd into alphabet-soup .[4 ]

[4] This works with vi , nvi , and vim , but not with elvis 2.0, vile 7.4, or vile 8.0.

\< \>

Matches characters at the beginning (\< ) or at the end (\> ) of a word. The end or beginning of a word is determined either by a punctuation mark or by a space. For example, the expression \<ac will match only words that begin with ac , such as action . The expression ac\> will match only words that end with ac , such as maniac . Neither expression will match react . Note that unlike $...$ , these do not have to be used in matched pairs.

~

Matches whatever regular expression was used in the last search. For example, if you searched for The , you could search for Then with /~n . Note that you can use this pattern only in a regular search (with / ).[5 ] It won't work as the pattern in a substitute command. It does, however, have a similar meaning in the replacement portion of a substitute command.

[5] This is a rather flaky feature of the original vi . After using it, the saved search pattern is set to the new text typed after the ~ , not the combined new pattern, as one might expect. Also, none of the clones behaves this way. So, while this feature exists, it has little to recommend its use.

Several of the clones support optional, extended regular expression syntaxes. See Section 8.4, "Extended Regular Expressions" in Chapter 8 for more information.

6.3.2 POSIX Bracket Expressions

We have just described the use of brackets for matching any one of the enclosed characters, such as [a-z] . The POSIX standard introduced additional facilities for matching characters that are not in the English alphabet. For example, the French è is an alphabetic character, but the typical character class [a-z] would not match it. Additionally, the standard provides for sequences of characters that should be treated as a single unit when matching and collating (sorting) string data.

POSIX also formalizes the terminology. Groups of characters within brackets are called a "bracket expression" in the POSIX standard. Within bracket expressions, beside literal characters such as a , ! , and so on, you can have additional components. These are:

Character classes . A POSIX character class consists of keywords bracketed by [: and :] . The keywords describe different classes of characters such as alphabetic characters, control characters, and so on (see Table 6.1 ).
Collating symbols . A collating symbol is a multi-character sequence that should be treated as a unit. It consists of the characters bracketed by [. and .] .
Equivalence classes . An equivalence class lists a set of characters that should be considered equivalent, such as e and è . It consists of a named element from the locale, bracketed by [= and =] .

All three of these constructs must appear inside the square brackets of a bracket expression. For example [[:alpha:]!] matches any single alphabetic character or the exclamation point, [[.ch.]] matches the collating element ch , but does not match just the letter c or the letter h . In a French locale, [[=e=]] might match any of e , è , or é . Classes and matching characters are shown in Table 6.1 .

Table 6.1: POSIX Character Classes
Class	Matching Characters
`[:alnum:]`	Alphanumeric characters
`[:alpha:]`	Alphabetic characters
`[:blank:]`	Space and tab characters
`[:cntrl:]`	Control characters
`[:digit:]`	Numeric characters
`[:graph:]`	Printable and visible (non-space) characters
`[:lower:]`	Lowercase characters
`[:print:]`	Printable characters (includes whitespace)
`[:punct:]`	Punctuation characters
`[:space:]`	Whitespace characters
`[:upper:]`	Uppercase characters
`[:xdigit:]`	Hexadecimal digits

You will have to do some research to determine if you have this facility in your version of vi . You may need to use a special option to enable POSIX compliance, have a particular environment variable set, or use a version of vi that is in an unusual directory.

vi on HP-UX 9.x (and newer) systems support POSIX bracket expressions, as does /usr/xpg4/bin/vi , on Solaris (but not /usr/bin/vi ). This facility is also available in nvi , and in elvis 2.1. As commercial UNIX vendors become standards-compliant, expect to see this feature become more widespread.

6.3.3 Metacharacters Used in Replacement Strings

When you make global replacements, the regular expressions above carry their special meaning only within the search portion (the first part) of the command.

For example, when you type this:

:%s/1\.  Start/2.  Next, start with $100/

note that the replacement string treats the characters . and $ literally, without your having to escape them. By the same token, let's say you enter:

:%s/[ABC]/[abc]/g

If you're hoping to replace A with a , B with b , and C with c , you'll be surprised. Since brackets behave like ordinary characters in a replacement string, this command will change every occurrence of A , B , or C to the five-character string [abc] .

To solve problems like this, you need a way to specify variable replacement strings. Fortunately, there are additional metacharacters that have special meaning in a replacement string.

\ n

Is replaced with text matched by the n th pattern previously saved by $ and $ , where n is a number from 1 to 9, and previously saved patterns (kept in hold buffers) are counted from the left on the line. See the explanation for $ and $ earlier in this chapter.

\

Treats the following special character as an ordinary character. Backslashes are metacharacters in replacement strings as well as in search patterns. To specify a real backslash, type two in a row (\\).

&

Is replaced with the entire text matched by the search pattern when used in a replacement string. This is useful when you want to avoid retyping text:

:%s/Yazstremski/&, Carl/

The replacement will say Yazstremski, Carl . The & can also replace a variable pattern (as specified by a regular expression). For example, to surround each line from 1 to 10 with parentheses, type:

:1,10s/.*/(&)/

The search pattern matches the whole line, and the & "replays" the line, followed by your text.

~

Has a similar meaning as when it is used in a search pattern; the string found is replaced with the replacement text specified in the last substitute command. This is useful for repeating an edit. For example, you could say :s/thier/their/ on one line and repeat the change on another with :s/thier/~/ . The search pattern doesn't need to be the same, though.

For example, you could say :s/his/their/ on one line and repeat the replacement on another with :s/her/~/ .[6 ]

[6] Modern versions of the ed editor use % as the sole character in the replacement text to mean "the replacement text of the last substitute command."

\u or \l

Causes the next character in the replacement string to be changed to uppercase or lowercase, respectively. For example, to change yes, doctor into Yes, Doctor , you could say:

:%s/yes, doctor/\uyes, \udoctor/

This is a pointless example, though, since it's easier just to type the replacement string with initial caps in the first place. As with any regular expression, \u and \l are most useful with a variable string. Take, for example, the command we used earlier:

:%s/\(That\) or \(this\)/\2 or \1/

The result is this or That , but we need to adjust the cases. We'll use \u to uppercase the first letter in this (currently saved in hold buffer 2); we'll use \l to lowercase the first letter in That (currently saved in hold buffer 1):

:s/\(That\) or \(this\)/\u\2 or \l\1/

The result is This or that . (Don't confuse the number one with the lowercase l ; the one comes after.)

\U or \L and \e or \E

\U and \L are similar to \u or \l , but all following characters are converted to uppercase or lowercase until the end of the replacement string or until \e or \E is reached. If there is no \e or \E , all characters of the replacement text are affected by the \U or \L . For example, to uppercase Fortran , you could say:

:%s/Fortran/\UFortran/

or, using the & character to repeat the search string:

:%s/Fortran/\U&/

All pattern searches are case-sensitive. That is, a search for the will not find The . You can get around this by specifying both uppercase and lowercase in the pattern:

/[Tt]he

You can also instruct vi to ignore case by typing :set ic . See Chapter 7, Advanced Editing , for additional details.

6.3.4 More Substitution Tricks

You should know some additional important facts about the substitute command:

A simple :s is the same as :s//~/ . In other words, repeat the last substitution. This can save enormous amounts of time and typing when you are working your way through a document making the same change repeatedly, but you don't want to use a global substitution.
If you think of the & as meaning "the same thing" (as in what was just matched), this command is relatively mnemonic. You can follow the & with a g , to make the substitution globally on the line, and even use it with a line range:
```
:%&g	repeat the last substitution everywhere
```
The [&] key can be used as a vi command to perform the :& command, i.e., to repeat the last substitution. This can save even more typing than :s [RETURN] ; one keystroke versus three.
The :~ command is similar to the :& command, but with a subtle difference. The search pattern used is the last regular expression used in any command, not necessarily the one used in the last substitute command.

For example,[7 ] in the sequence:

[7] Thanks to Keith Bostic, in the nvi documentation, for this example.
```
:s/red/blue/
:/green
:~
```
The :~ is equivalent to :s/green/blue/ .
Besides the / character, you may use any non-alphanumeric, non-whitespace character as your delimiter, except backslash, double-quote, and the vertical bar (\ , " , and | ). This is particularly handy when you have to make a change to a pathname.
```
:%s;/user1/tim;/home/tim;g
```
When the edcompatible option is enabled, vi remembers the flags (g for global and c for confirmation) used on the last substitute, and applies them to the next one.

This is most useful when you are moving through a file and you wish to make global substitutions. You can make the first change:
```
:s/old

/new

/g
:set edcompatible
```
After that, subsequent substitute commands will be global.

Despite the name, no known version of UNIX ed actually works this way.