26.4 Using Metacharacters in Regular ExpressionsThere are three important parts to a regular expression:
A simple example that demonstrates all three parts is the regular expression: ^#* The caret ( There are two main types of regular expressions: simple regular expressions and extended regular expressions. (As we'll see later in the article, the boundaries between the two types have become blurred as regular expressions have evolved.) A few utilities like awk and egrep use the extended regular expression. Most use the simple regular expression. From now on, if I talk about a "regular expression" (without specifying simple or extended), I am describing a feature common to both types. The commands that understand just simple regular expressions are: vi , sed , grep , csplit , dbx , more , ed , expr , lex , and pg . The utilities awk , nawk , and egrep understand extended regular expressions. [The situation is complicated by the fact that simple regular expressions have evolved over time, and so there are versions of "simple regular expressions" that support extensions missing from extended regular expressions! Bruce explains the incompatibility at the end of his article. -TOR ] 26.4.1 The Anchor Characters: ^ and $
Most UNIX text facilities are line-oriented. Searching for patterns
that span several lines is not easy to do.
You see, the end-of-line character is not included in the block of
text that is searched.
It is a separator.
Regular expressions examine the text between the separators.
If you want to search for a pattern that is at one end or the other,
you use
anchors
.
The caret (
The use of
It is one of those choices that other utilities go along with to
maintain consistency.
For instance,
26.4.2 Matching a Character with a Character Set
The simplest character set is a character.
The regular expression
Some characters have a special meaning in regular expressions.
If you want to search for such a character as itself, escape it with a
backslash ( 26.4.3 Match any Character with . (Dot)
The dot ( 26.4.4 Specifying a Range of Characters with [...]
If you want to match specific characters, you can use
square brackets,
the regular expression would be:
[To be specific:
A range is a contiguous series of characters, from low to high, in the
ASCII chart (51.3
)
.
For example, 26.4.5 Exceptions in a Character Set
You can easily search for all characters except those in square
brackets by putting a
caret ( Like the anchors in places that can't be considered an anchor, the
right square bracket (
26.4.6 Repeating Character Sets with
|
Regular Expression | Matches |
---|---|
* | Any line with a *
|
\* | Any line with a *
|
\\ | Any line with a \
|
^* | Any line starting with a *
|
^A* | Any line |
^A\* | Any line starting with an A*
|
^AA* | Any line starting with one A |
^AA*B | Any line starting with one or more A's followed by a B |
^A\{4,8\}B | Any line starting with four, five, six, seven, or eight A's followed by a B |
^A\{4,\}B | Any line starting with four or more A's followed by a B |
^A\{4\}B | Any line starting with an AAAAB |
\{4,8\} | Any line with a {4,8} |
A{4,8} | Any line with an A{4,8} |
Searching for a word isn't quite as simple as it at first appears.
The string
the
will match the word
other
.
You can put spaces before and after the letters and use this regular
expression:
the
.
However, this does not match words at the beginning or the end of the line.
And it does not match the case where there is a punctuation mark
after the word.
There is an easy solution - at least in many versions of ed
, ex
, and
vi
.
The characters
\<
and
\>
are similar to the
^
and
$
anchors,
as they don't occupy a position of a character.
They do
anchor
the expression between to match only if it is on a word boundary.
The pattern to search for the words
the
and The
would be:
\<[tT]he\>
.
Let's define a "word boundary."
The character before the
t
or T
must be either a newline character or anything except a letter,
digit, or underscore ( _
).
The character after the
e
must
also be a character other than a digit, letter, or underscore,
or it could be the end-of-line character.
Another pattern that requires a special mechanism is searching for
repeated words.
The expression
[a-z][a-z]
will match any two lowercase letters.
If you wanted to search for lines that had two adjoining identical
letters, the above pattern wouldn't help.
You need a way to remember what you found and see if
the same pattern occurs again.
In some programs,
you can mark part of a pattern using
\(
and
\)
.
You can recall the remembered pattern with
\
followed by a single digit.
Therefore, to search for two identical letters, use:
\([a-z]\)\1
.
You can have nine different remembered patterns.
Each occurrence of
\(
starts a new pattern.
The regular expression to match a five-letter palindrome
(e.g., "radar") is:
\([a-z]\)\([a-z]\)[a-z]\2\1
.
[Some versions of some programs can't handle \( \)
in the same
regular expression as \
1
, etc.
In all versions of sed
, you're safe if you use
/( /)
on the pattern side of an s
command-and/
1
, etc., on the replacement side . (34.10
)
-JP
]
That completes a discussion of simple regular expressions. Before I discuss the extensions that extended expressions offer, I want to mention two potential problem areas.
The
/<
and
/>
characters were introduced in the
vi
editor. The other programs didn't have this ability at that time.
Also, the
/{
min
,
max
/}
modifier is new, and earlier utilities didn't have this ability.
This makes it difficult for the novice user of regular expressions,
because it seems as if each utility has a different convention.
Sun has retrofitted the newest regular expression library to all of
their programs, so they all have the same ability.
If you try to use these newer features on other vendors' machines, you
might find they don't work the same way.
The other potential point of confusion is the
extent of the pattern matches (26.6
)
.
Regular expressions match the longest possible pattern.
That is, the regular expression
A.*B
matches
AAB
as well as
AAAABBBBABCCCCBBBAAAB
.
This doesn't cause many problems using
grep
,
because an oversight in a regular expression will just match more
lines than desired.
If you use
sed
,
and your patterns get carried away, you may end up deleting or
changing more than you want to.
Two programs use extended regular expressions:
egrep
and
awk
.
[perl
uses expressions that are even more extended. -JP
]
With these extensions, those special characters preceded by a backslash
no longer have special meaning:
/{
,
/}
,
/<
,
/>
,
/(
,
/)
,
as well as
/
digit
.
There is a very good reason for this, which I will
delay explaining to build up suspense.
The
question mark (?
)
matches zero or one instances of the character set before it, and the
plus sign (+
)
matches one or more copies of the character set.
You can't use /{
and /}
in extended regular expressions,
but if you could, you might consider
?
to be the same as
/{0,1/}
and
+
to be the same as
/{1,/}
.
By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be no advantages and a lot of disadvantages. Therefore, examples would be useful.
The three important characters in the expanded regular expressions are
(
,
|
,
and
)
.
Parentheses are used to group expressions; the vertical bar acts an
an OR operator.
Together, they let you match a
choice
of patterns.
As an example, you can
use egrep
to print all
From:
and
Subject:
lines from your incoming mail:
%egrep '^(From|Subject): ' /usr/spool/mail/$USER
All lines starting with
From:
or
Subject:
will be printed. There is no easy way to do this with simple
regular expressions. You could try something like
^[FS][ru][ob][mj]e*c*t*:
and hope you don't have any lines that start with
Sromeet:
.
Extended expressions don't have
the
/<
and
/>
characters.
You can compensate by using the alternation mechanism.
Matching the word
"the"
in the beginning, middle, or end of a sentence or at the end of a line can be
done with the extended regular expression:
(^| )the([^a-z]|$)
.
There are two choices before the word: a space or the beginning of a
line.
Following the word, there must be something besides a lowercase letter or
else the end of the line.
One extra bonus with extended regular expressions is the ability to
use the
*
,
+
,
and
?
modifiers after a
(...)
grouping.
Here are two ways to match
"a simple problem",
"an easy problem",
as well as
"a problem";
the second expression is more exact:
%egrep "a[n]? (simple|easy)? ?problem" data
%egrep "a[n]? ((simple|easy) )?problem" data
I promised to explain why the backslash characters don't work in
extended regular expressions.
Well, perhaps the
/{.../}
and
/<.../>
could be added to the extended expressions, but
it might confuse people if those characters are added and the
/(.../)
are not. And there is no way to add that functionality to the extended
expressions without changing the current usage. Do you see why?
It's quite simple. If
(
has a special meaning, then
/(
must be the ordinary character.
This is the opposite of the simple regular expressions,
where
(
is ordinary and
/(
is special.
The usage of the parentheses is incompatible, and any change could
break old programs.
If the extended expression used
(...|...)
as regular characters, and
/(.../|.../)
for specifying alternate patterns, then it is possible to have one set
of regular expressions that has full functionality.
This is exactly
what
GNU Emacs (32.1
)
does, by the way-it combines
all of the features of regular and
extended expressions with one syntax.
-