6.3. Pattern-Matching Rules
In making global replacements, UNIX editors such as vi allow you to
search not just for fixed strings of characters,
but also for variable patterns of words, referred to as regular
expressions.
When you specify a literal string of characters, the search
might turn up other occurrences that you didn't want to match.
The problem with searching for words in a file is that a word
can be used in different ways.
Regular expressions
help you conduct a search for words in context.
Note that regular expressions can be used with the vi search
commands / and ? as well as in the ex :g
and :s commands.
For the most part, the same regular
expressions work with other UNIX programs such as grep,
sed, and awk.[19]
Regular expressions are made up by combining normal characters with a number
of special characters called metacharacters.[20]
The metacharacters and their uses are listed below.
6.3.1. Metacharacters Used in Search Patterns
- .
-
Matches any single character except a newline.
Remember that spaces are treated as characters.
For example, p.p matches character strings such as
pep, pip, and pcp.
- *
-
Matches zero or more (as many as there are) of the single character
that immediately precedes it. For example, bugs* will
match bugs (one s) or bug (no s's).
The * can follow a metacharacter.
For example, since . (dot) means any character,
.* means "match any number of any character."
Here's a specific
example of this. The command :s/End.*/End/ removes
all characters after End (it replaces the remainder of the
line with nothing).
- ^
-
When used at the start of a regular expression,
requires that the following regular expression be found at the beginning of
the line; for example, ^Part matches
Part when it occurs at the beginning of a line, and ^...
matches the first three characters of a line.
When not at the beginning of a regular expression, ^
stands for itself.
- $
-
When used at the end of a regular expression,
requires that the preceding regular expression be found at the end
of the line; for example, here:$
matches only when here: occurs at the end of a line.
When not at the end of a regular expression, $
stands for itself.
- \
-
Treats the following special character as an ordinary character.
For example,
\. matches an actual period instead of "any single
character," and \* matches an actual asterisk instead of
"any number of a character." The \ (backslash)
prevents the interpretation of a special character.
This prevention is called "escaping the character."
(Use \\ to get a literal backslash.)
- [ ]
-
Matches any one of the characters enclosed between the brackets.
For example,
[AB]
matches either
A
or
B,
and
p[aeiou]t
matches
pat, pet, pit, pot, or put.
A range of consecutive characters can be specified by separating
the first and last characters in the range with a hyphen.
For example, [A-Z] will match any uppercase
letter from A to Z, and [0-9] will match any
digit from 0 to 9.
You can include more than one
range inside brackets, and you can specify a mix of ranges and
separate characters. For example, [:;A-Za-z()]
will match four different punctuation marks, plus all letters.
Most metacharacters lose their special meaning inside brackets,
so you don't need to escape them if you want to use them as
ordinary characters. Within brackets, the three metacharacters
you still need to escape
are \ - ].
The hyphen (-)
acquires meaning as a range specifier; to use an actual hyphen,
you can also place it as the first character inside the
brackets.
A caret (^) has special meaning only when it is the
first character inside the brackets, but in this case the meaning
differs from that of the normal ^ metacharacter.
As the first character within brackets, a ^ reverses their sense: the brackets
will match any one character not in the list. For example,
[^a-z] matches any character that is not a lowercase letter.
- \( \)
-
Saves the pattern enclosed between \( and \)
into a special holding space or "hold buffer."
Up to nine patterns can be saved in this way on a single line.
For example, the pattern:
\(That\) or \(this\)
saves That in hold buffer number 1 and
saves this in hold buffer number 2.
The patterns held can be "replayed" in substitutions by the sequences
\1 to \9.
For example, to rephrase That or this to read
this or That, you could enter:
:%s/\(That\) or \(this\)/\2 or \1/
You can also use the \n
notation within a search or substitute string:
:s/\(abcd\)\1/alphabet-soup/
changes abcdabcd into
alphabet-soup.[21]
- \< \>
-
Matches characters at the beginning (\<) or at the end
(\>) of a word.
The end or beginning of a
word is determined either by a punctuation mark or by a space.
For example, the expression \<ac will match only words
that begin with ac, such as action.
The expression ac\> will match only words
that end with ac, such as maniac.
Neither expression will match react.
Note that unlike \(...\),
these do not have to be used in matched pairs.
- ~
-
Matches whatever regular expression was used in the last
search. For example, if you searched for The, you
could search for Then with /~n.
Note that you can use this pattern only in a regular search
(with /).[22]
It won't work as the pattern in a substitute command. It does,
however, have a similar
meaning in the replacement portion of a substitute command.
Several of the clones support optional, extended regular
expression syntaxes. See Section 8.4 for more information.
6.3.2. POSIX Bracket Expressions
We have just described
the use of brackets for matching
any one of the enclosed characters, such as [a-z].
The POSIX standard introduced additional facilities for matching
characters that are not
in the English alphabet. For example, the French è is an alphabetic
character, but the typical character class [a-z] would not
match it.
Additionally,
the standard provides for sequences of characters that should be
treated as a single unit when matching and collating (sorting) string data.
POSIX also formalizes the terminology. Groups of characters within
brackets are called
a "bracket expression" in the POSIX
standard. Within bracket expressions, beside literal characters such as
a, !,
and so on, you can have additional components. These are:
Character classes.
A POSIX character class consists of keywords bracketed by
[: and :]. The
keywords describe different classes of characters such as alphabetic
characters, control characters, and so on (see Table 6.1). Collating symbols.
A collating symbol is a multi-character sequence that should be treated
as a unit. It consists of the characters bracketed by [.
and .]. Equivalence classes.
An equivalence class lists a set of characters that should be considered
equivalent, such as e and
è.
It consists of a named element from the locale,
bracketed by [=
and =].
All three of these constructs must appear inside the square
brackets of a bracket expression.
For example [[:alpha:]!] matches
any single alphabetic character or the exclamation point,
[[.ch.]]
matches the collating element
ch, but does not match just the letter
c or the letter
h.
In a French locale,
[[=e=]] might match any of
e, è,
or é. Classes and matching
characters are shown in Table 6.1.
Table 6.1. POSIX Character Classes
Class |
Matching Characters |
[:alnum:] |
Alphanumeric characters |
[:alpha:] |
Alphabetic characters |
[:blank:] |
Space and tab characters |
[:cntrl:] |
Control characters |
[:digit:] |
Numeric characters |
[:graph:] |
Printable and visible (non-space) characters |
[:lower:] |
Lowercase characters |
[:print:] |
Printable characters (includes whitespace) |
[:punct:] |
Punctuation characters |
[:space:] |
Whitespace characters |
[:upper:] |
Uppercase characters |
[:xdigit:] |
Hexadecimal digits |
You will have to do some research to determine if you have this
facility in your version of vi. You may
need to use a special option to enable POSIX compliance,
have a particular environment variable set, or use a version of
vi that is in an unusual directory.
vi on HP-UX 9.x (and newer) systems support
POSIX bracket expressions,
as does /usr/xpg4/bin/vi, on Solaris
(but not /usr/bin/vi).
This facility is also available in nvi, and in
elvis 2.1.
As commercial UNIX vendors become standards-compliant,
expect to see this feature become more widespread.
6.3.3. Metacharacters Used in Replacement Strings
When you make global replacements, the regular expressions above
carry their special meaning only within the search portion
(the first part) of the command.
For example, when you type this:
:%s/1\. Start/2. Next, start with $100/
note that the replacement string
treats the characters . and $
literally, without your
having to escape them.
By the same token, let's say you enter:
:%s/[ABC]/[abc]/g
If you're hoping to replace A with
a, B with b,
and C with c,
you'll be surprised. Since brackets behave like
ordinary characters in a replacement string, this command
will change every occurrence of A,
B, or C to the
five-character string [abc].
To solve problems like this,
you need a way to specify variable
replacement strings. Fortunately, there are additional metacharacters
that have special meaning in a replacement string.
- \n
-
Is replaced with text matched by the nth pattern
previously saved by \( and
\), where
n is a number from 1 to 9, and previously saved patterns
(kept in hold buffers) are counted
from the left on the line.
See the explanation for
\(
and \) earlier in this chapter.
- \
-
Treats the following special character as an ordinary character.
Backslashes are metacharacters in replacement strings
as well as in search patterns.
To specify a real backslash, type two in a row (\\).
- &
-
Is replaced with the entire text matched by the search pattern when
used in a replacement
string. This is useful when you want to avoid retyping text:
:%s/Yazstremski/&, Carl/
The replacement will say Yazstremski, Carl. The
& can
also replace a variable pattern (as specified by a regular
expression). For example, to surround each line from 1 to 10 with
parentheses, type:
:1,10s/.*/(&)/
The search pattern matches the whole line, and the &
"replays" the line, followed by your text.
- ~
-
Has a similar meaning as when it is used in a search pattern;
the string found is replaced with the replacement
text specified in the last substitute command. This is useful for
repeating an edit. For example, you could say
:s/thier/their/ on
one line and repeat the change on another with
:s/thier/~/. The search pattern
doesn't need to be the same, though.
- For example, you could say :s/his/their/ on
one line and repeat the replacement on another with
:s/her/~/.[23]
- \u or \l
-
Causes the next character in the replacement string to be changed to
uppercase or lowercase, respectively. For example, to change
yes, doctor into Yes, Doctor, you could say:
:%s/yes, doctor/\uyes, \udoctor/
This is a pointless example, though, since it's easier
just to type the replacement string with initial caps in the
first place. As with any regular expression, \u and
\l are most useful with a variable string. Take, for
example, the command we used earlier:
:%s/\(That\) or \(this\)/\2 or \1/
The result is this or That, but we need to adjust the
cases. We'll use \u to uppercase the first letter in
this (currently saved in hold buffer 2);
we'll use \l to lowercase the first letter in
That (currently saved in hold buffer 1):
:s/\(That\) or \(this\)/\u\2 or \l\1/
The result is This or that. (Don't confuse the number one
with the lowercase l; the one comes after.)
- \U or \L and \e or \E
-
\U and \L
are similar to \u or \l,
but all following characters are
converted to uppercase or lowercase until the end of the
replacement string or until \e or \E is reached.
If there is no \e or \E, all characters of the
replacement text are affected by the \U or \L.
For example, to uppercase Fortran, you could say:
:%s/Fortran/\UFortran/
or, using the & character to repeat the search string:
:%s/Fortran/\U&/
All pattern searches are case-sensitive. That is,
a search for the will
not find The. You can get around this by specifying both
uppercase and lowercase in the pattern:
/[Tt]he
You can also instruct vi to ignore case by typing
:set ic.
See Chapter 7,
for additional details.
6.3.4. More Substitution Tricks
You should know some additional important facts about the
substitute command:
A simple :s is the same as :s//~/.
In other words, repeat the last substitution.
This can save enormous amounts of time and typing when you
are working your way through a document making the same change
repeatedly, but you don't want to use a global substitution.
If you think of the & as meaning
"the same thing"
(as in what was just matched), this command is relatively mnemonic.
You can follow the & with a g,
to make the substitution
globally on the line, and even use it with a line range:
:%&g repeat the last substitution everywhere
The & key can be used as a vi command
to perform the :& command, i.e., to repeat the
last substitution. This can save even more typing than
:sRETURN; one keystroke versus three.
The :~ command is similar to the :&
command, but with a subtle difference.
The search pattern used is the last regular
expression used in any
command, not necessarily the one used in the last substitute command.
For example,[24]
in the sequence:
:s/red/blue/
:/green
:~
The :~
is equivalent to
:s/green/blue/. Besides the / character,
you may use any non-alphanumeric, non-whitespace
character as your delimiter,
except backslash, double-quote, and the vertical bar
(\, ",
and |).
This is particularly handy when you have to make a change to
a pathname.
:%s;/user1/tim;/home/tim;g
When the edcompatible option is
enabled, vi remembers the flags
(g for global and c for
confirmation) used on the last substitute, and applies them
to the next one.
This is most useful when you are moving through a file and
you wish to make global substitutions. You can make the
first change:
:s/old/new/g
:set edcompatible
After that, subsequent substitute commands will be global.
Despite the name, no known version of UNIX ed
actually works this way.
| | | 6.2. Context-Sensitive Replacement | | 6.4. Pattern-Matching Examples |
Copyright © 2003 O'Reilly & Associates. All rights reserved.
|