A regular expression is a pattern. Some parts of the pattern match single characters in the string of a particular type. Other parts of the pattern match multiple characters. First, we'll visit the single-character patterns, and then the multiple-character patterns.
The simplest and most common pattern-matching character in regular expressions is a single character that matches itself. In other words, putting a letter
a
in a regular expression requires a corresponding letter
a
in the string.
The next most common pattern-matching character is the dot "
.
". This character matches any single character
except
newline (
\n
). For example, the pattern
/a./
matches any two-letter sequence that starts with
a
and is not
a\n
.
A pattern-matching
character class
is represented by a pair of
open and close square brackets and a list of characters between the brackets. One and only one of these characters must be present at the corresponding part of the string for the pattern to match. For example,
/[abcde]/
matches a string containing any one of the first five letters of the lowercase alphabet, while
/[aeiouAEIOU]/
matches any of the five
vowels in either lower- or uppercase. If you want to put a right bracket (
]
) in the list, put a backslash in front of it, or put it as the first character within the list.
Ranges of characters (like
a
through
z
) can be abbreviated by showing the end points of the range separated by a
dash (
-
); to get a literal dash in the list, precede the dash with a
backslash or place it at the end. Here are some other examples:
[0123456789] # match any single digit
[0-9] # same thing
[0-9\-] # match 0-9, or minus
[a-z0-9] # match any single lowercase letter or digit
[a-zA-Z0-9_] # match any single letter, digit, or underscore
There's also a negated character class, which is the same as a character class, but has a leading up arrow (or caret:
^
) immediately after the left bracket. This character class matches any single character that is
not
in the list. For example:
[^0-9] # match any single non-digit
[^aeiouAEIOU] # match any single non-vowel
[^\^] # match single character except an up-arrow
For your convenience, some common character classes are predefined, as described in
Table 7.1
.
Table 7.1: Predefined Character Class Abbreviations
Construct
|
Equivalent Class
|
Negated Construct
|
Equivalent Negated Class
|
\d (a digit)
|
[0-9]
|
\D (digits, not!)
|
[^0-9]
|
\w (word char)
|
[a-zA-Z0-9_]
|
\W (words, not!)
|
[^a-zA-Z0-9_]
|
\s (space char)
|
[ \r\t\n\f]
|
\S (space, not!)
|
[^ \r\t\n\f]
|
The
\d
pattern matches one
digit
. The
\w
pattern matches one
word character
, although the pattern is really matching any character that is legal in a Perl variable name. The
\s
pattern matches one
space
(
whitespace), defined here as spaces, carriage returns, tabs, line feeds, and form feeds. The uppercase versions match the complements of these classes. Thus, \W matches one character that can't be in an identifier, \S matches one character that is not a whitespace (including letters, punctuation marks, control characters, etc.), and \D matches any single non-digit character.
These abbreviated classes can be used as part of other character classes as well:
[\da-fA-F] # match one hex digit
The true power of regular expressions comes into play when you can say "one or more of these" or "up to five of those." Let's talk about how these cases are handled.
The first (and probably most obvious) grouping pattern is
sequence
. In using this pattern, Perl matches
abc
as an
a
followed by
a
b
followed by
a
c.
This pattern seems simple, but we're giving it a name so we can talk about it later.
We've already seen the
asterisk (
*
) as a grouping pattern. The asterisk indicates zero or more of the immediately previous character (or character class).
Two other grouping patterns that work in the same manner are the
plus sign (
+
), meaning one or more of the immediately previous character, and the
question mark (
?
), meaning zero or one of the immediately previous character. For example, the regular expression
/fo+ba?r/
matches an
f
followed by one or more
o
's, followed by a
b
, followed by an optional
a
, followed by an
r
.
In all three of these grouping patterns, the patterns are
greedy
. If such a multiplier has a chance to match between five and ten characters, it'll pick the ten-character string every time. For example,
$_ = "fred xxxxxxxxxx barney";
s/x+/boom/;
always replaces all consecutive x's with
boom
(resulting in
fred boom barney
), rather than just one or two x's, even though a shorter set of x's would also match the same regular expression.
If you need to say "five to ten" x's, you could get away with putting five x's followed by five x's each immediately followed by a question mark. But this looks ugly. Instead, an easier way exists: the
general multiplier
. The general multiplier consists of a pair of matching
curly braces with one or two numbers inside, as in
/x{5,10}/
. The immediately preceding character (in this case, the letter
x
) must be found within the indicated number of repetitions (five through ten here).[
]
If you leave off the second number, as in
/x{5,}/
, you indicate "that many or more" (five or more in this case), and if you leave off the comma, as in
/x{5}/
, you indicate "exactly this many" (five
x
's). To get five or fewer
x
's, you must put the zero in, as in
/x{0,5}/
.
So, the regular expression
/a.{5}b/
matches the letter
a
separated from the letter
b
by any five non-newline characters at any point in the string. (Recall that a period matches any single non-newline character, and we're matching five here.) The five characters do not need to be the same. (We'll learn how to force them to be the same in the next section.)
We could dispense with
*
,
+
, and
?
entirely, because they are completely equivalent to
{0,}
,
{1,}
, and
{0,1}
. But it's easier to type the equivalent single punctuation character, and more familiar as well.
If two multipliers occur in a single expression, the
greedy rule is augmented with
leftmost is greediest
. For example:
$_ = "a xxx c xxxxxxxx c xxx d";
/a.*c.*d/;
In this case, the first
.*
in the regular expression matches all characters up to the second
c
, even though matching only the characters up to the first
c
would still allow the entire regular expression to match. Right now, this distinction is not important (the pattern would match either way), but later when we can look at parts of the regular expression that matched, the distinction will matter quite a bit.
We can force any multiplier to be nongreedy (or
lazy
) by following it with a question mark:
$_ = "a xxx c xxxxxxxx c xxx d";
/a.*?c.*d/;
Here, the
a.*?c
matches the fewest characters between the
a
and
c
, not the most characters. This means the leftmost
c
is matched, not the rightmost. You can put such a question-mark modifier after any of the multiplers (
?,+,*
and
{m,n}
).
What if the string and regular expression were slightly altered, say, to:
$_ = "a xxx ce xxxxxxxx ci xxx d";
/a.*ce.*d/;
In this case, if the
.*
matches the most characters possible before the next
c
, the next regular expression character (
e
) doesn't match the next character of the string (
i
). In this case, we get automatic
backtracking
. The multiplier is unwound and retried, stopping at someplace earlier (in this case, at the earlier
c
, next to the
e
).[
] A complex regular expression may involve many such levels of backtracking, leading to long execution times. In this case, consider that making that match lazy (with a trailing
?
) will actually simplify the work that Perl has to perform.
Another grouping operator is a pair of open and close
parentheses around any part pattern. This operator doesn't change whether the pattern matches, but instead causes the part of the string matched by the pattern to be remembered, so that it may be referenced later. So, for example,
(a)
still matches an
a
, and
([a-z])
still matches any single lowercase letter.
To recall a memorized part of a string, you must include a
backslash followed by an integer. This pattern construct represents the same sequence of characters matched earlier in the same-numbered pair of parentheses (counting from one). For example:
/fred(.)barney
\1/;
matches a string consisting of
fred
, followed by any single non-newline character, followed by
barney
, followed by that same single character. So, the string matches
fredxbarneyx
, but not
fredxbarneyy
. Compare that string with:
/fred.barney./;
in which the two unspecified characters can be the same, or different.
Where did the
1
come from? The 1 indicates the first parenthesized part of the regular expression. If there's more than one, the second part (counting the left parentheses from left to right) is referenced as
\2
, the third as
\3
, and so on. For example:
/a(.)b(.)c\2d\1/;
matches an
a
, a character (call it #1), a
b
, another character (call it #2), a
c
, the character #2, a
d
, and the character #1. So, the string matches
axbycydx
, for example.
The referenced part can be more than a single character. For example,
/a(.*)b\1c/;
matches an
a
, followed by any number of characters (even zero), followed by
b
, followed by that same sequence of characters, followed by
c
. So, the string would match
aFREDbFREDc
, or even
abc
, but not
aXXbXXXc
.
Another grouping construct is
alternation, as in
a|b|c
. This construct matches exactly one of the alternatives (
a
or
b
or
c
, in this case). This construct works even if the alternatives have multiple characters, as in
/song|blue/
, which matches either
song
or
blue
. (For single-character alternatives, you're definitely better off with a character class like
/[abc]/
.)
What if we wanted to match
songbird
or
bluebird
? We could write
/songbird|bluebird/
, but that
bird
part shouldn't have to be in there twice. In fact, there's a way out, but we have to talk about the precedence of grouping patterns, which is covered later in the section
"Precedence
."
Several special notations
anchor a pattern. Normally, when a pattern is matched against the string, the beginning of the pattern is dragged through the string from left to right, matching at the first possible opportunity. Anchors allow you to ensure that parts of the pattern line up with particular parts of the string.
The first pair of anchors requires that a particular part of the match be located either at a
word boundary or not at a word boundary. The
\b
anchor requires a word boundary at the indicated point for the pattern to match. A word boundary is the place between characters that match
\w
and
\W
, or between characters matching
\w
and the beginning or ending of the string. Note that this description has little to do with English words and a lot more to do with C symbols, but that's as close as we get. For example:
/fred\b/; # matches fred, but not Frederick
/\bmo/; # matches moe and mole, but not Elmo
/\bFred\b/; # matches Fred but not Frederick or alFred
/\b\+\b/; # matches "x+y" but not "++" or " + "
/abc\bdef/; # never matches (impossible for a boundary there)
Likewise,
\B
requires that there not be a word boundary at the indicated point. For example:
/\bFred\B/; # matches "Frederick" but not "Fred Flintstone"
Two more anchors require that a particular part of the pattern be next to an end of the string. The
caret (
^
) matches the beginning of the string if it is in a place that makes sense to match the beginning of the string. For example,
^a
matches an
a
if, and only if, the
a
is the first character of the string. However,
a^
matches the two characters
a
and
^
anywhere in the string. In other words, the caret has lost its special meaning. If you need the caret to be a literal caret even at the beginning, put a backslash in front of it.
The
$
, like the
^
, anchors the pattern, but to the end of the string, not the beginning. In other words,
c$
matches a
c
only if it occurs at the end of the string.[
] A dollar sign anywhere else in the pattern is probably going to be interpreted as a scalar value interpretation, so you'll most likely need to
backslash it to match a literal dollar sign in the string.
Other anchors are supported, including \A, \Z, and lookahead anchors created via (?=...) and (?!...). These anchors are described fully in
Chapter 2
of
Programming Perl
and the
perlre
documentation.
So what happens when we get
a|b*
together? Is this
a
or
b
any number of times, or is it either a single
a
or any number of
b
's?
Well, just as operators have precedence, the grouping and anchoring patterns also have precedence. The precedence of patterns from highest to lowest is given in
Table 7.2
.
According to the table,
*
has a higher precedence than
|
. So
/a|b*/
is interpreted as a single
a
, or any number of
b
's.
What if we want the other meaning, as in "any number of a's or b's"? We simply throw in a pair of
parentheses. In this case, we enclose the part of the expression that the
*
operator should apply to inside parentheses, and we are done, as
(a|b)*
. If you want to clarify the first expression, you can redundantly parenthesize it with
a|(b*)
.
When you use
parentheses to affect precedence they also trigger the memory, as shown earlier in this chapter. That is, this set of parentheses counts when you are figuring out whether something is
\2
,
\3
, or whatever. If you want to use parentheses without triggering memory, use the form (?:...) instead of (...). This form still allows for multipliers, but doesn't cause you to throw off your counting by using up another $4 or whatever. For example,
/(?:Fred|Wilma) Flintstone/
does not store anything into $1; it's just there for grouping.
Here are some other examples of regular expressions, and the effect of parentheses:
abc* # matches ab, abc, abcc, abccc, abcccc, and so on
(abc)* # matches "", abc, abcabc, abcabcabc, and so on
^x|y # matches x at the beginning of line, or y anywhere
^(x|y) # matches either x or y at the beginning of a line
a|bc|d # a, or bc, or d
(a|b)(c|d) # ac, ad, bc, or bd
(song|blue)bird # songbird or bluebird
|