A regular expression is a pattern. Some parts of the pattern match single characters in the string of a particular type. Other parts of the pattern match multiple characters. First, we'll visit the single-character patterns and then the multiple-character patterns.
The simplest and most common pattern-matching character in regular expressions is a single character that matches itself. In other words, putting a letter
in a regular expression requires a corresponding letter
in the string.
The next most common pattern matching character is the dot "
". This matches any single character except
). For example, the pattern
matches any two-letter sequence that starts with
and is not
is represented by a pair of
open and close square brackets and a list of characters between the brackets. One and only one of these characters must be present at the corresponding part of the string for the pattern to match. For example,
matches a string containing any one of the first five letters of the lowercase alphabet, while
matches any of the five
vowels in either lower- or uppercase. If you want to put a right bracket (
) in the list, put a backslash in front of it, or put it as the first character within the list.
Ranges of characters (like
) can be abbreviated by showing the end points of the range separated by a
); to get a literal dash in the list, precede the dash with a
backslash or place it at the end. Here are some other examples:
 # match any single digit
[0-9] # same thing
[0-9\-] # match 0-9, or minus
[a-z0-9] # match any single lowercase letter or digit
[a-zA-Z0-9_] # match any single letter, digit, or underscore
There's also a negated character class, which is the same as a character class, but has a leading
up-arrow (or caret:
) immediately after the left bracket. This character class matches any single character that is not in the list. For example:
[^0-9] # match any single non-digit
[^aeiouAEIOU] # match any single non-vowel
[^\^] # match single character except an up-arrow
For your convenience, some common character classes are predefined, as described in
Table 7.1: Predefined Character Class Abbreviations
Equivalent Negated Class
pattern matches one "digit." The
pattern matches one "
word character," although what it is really matching is any character that is legal in a Perl variable name. The
pattern matches one "space" (
whitespace), here defined as spaces, carriage returns (not often used in UNIX), tabs, line feeds, and form feeds. The uppercase versions match the complements of these classes. Thus,
matches one character that can't be in an identifier,
matches one character that is not whitespace (including letter, punctuation, control characters, and so on), and
matches any single nondigit character.
These abbreviated classes can be used as part of other character classes as well:
[\da-fA-F] # match one hex digit
The true power of regular expressions comes into play when you can say "one or more of these" or "up to five of those." Let's talk about how this is done.
The first (and probably least obvious) grouping pattern is
. This means that
followed by a
followed by a
. Seems simple, but we're giving it a name so we can talk about it later.
We've already seen the
) as a grouping pattern. The asterisk indicates zero or more of the immediately previous character (or character class).
Two other grouping patterns that work like this are the
plus sign (
), meaning one or more of the immediately previous character, and the
question mark (
), meaning zero or one of the immediately previous character. For example, the regular expression
followed by one or more
's followed by a
, followed by an optional
, followed by an
In all three of these grouping patterns, the patterns are
greedy. If such a multiplier has a chance to match between five and ten characters, it'll pick the 10-character string every time. For example,
$_ = "fred xxxxxxxxxx barney";
always replaces all consecutive x's with
fred boom barney
), rather than just one or two x's, even though a shorter set of x's would also match the same regular expression.
If you need to say "five to ten" x's, you could get away with putting five x's followed by five x's each immediately followed by a question mark. But this looks ugly. Instead, there's an easier way: the
. The general multiplier consists of a pair of matching
curly braces with one or two numbers inside, as in
. The immediately preceding character (in this case, the letter "
") must be found within the indicated number of repetitions (five through ten here).[
If you leave off the second number, as in
, it means "that many or more" (five or more in this case), and if you leave off the comma, as in
, it means "exactly this many" (five x's). To get five or less
's, you must put the zero in, as in
So, the regular expression
matches the letter
separated from the letter
by any five non-newline characters at any point in the string. (Recall that a period matches any single non-newline character, and we're matching five here.) The five characters do not need to be the same. (We'll learn how to force them to be the same in the next section.)
We could dispense with
entirely, since they are completely equivalent to
. But it's easier to type the equivalent single punctuation character, and more familiar as well.
If two multipliers occur in a single expression, the
greedy rule is augmented with "leftmost is greediest." For example:
$_ = "a xxx c xxxxxxxx c xxx d";
In this case, the first "
" in the regular expression matches all characters up to the second
, even though matching only the characters up to the first
would still allow the entire regular expression to match. Right now, this doesn't make any difference (the pattern would match either way), but later when we can look at parts of the regular expression that matched, it'll matter quite a bit.
We can force any multiplier to be nongreedy (or
) by following it with a question mark:
$_ = "a xxx c xxxxxxxx c xxx d";
now matches the fewest characters between the
, not the most characters. This means the leftmost
is matched, not the rightmost. You can put such a question-mark modifier after any of the multipliers (
What if the string and regular expression were slightly altered, say, to:
$_ = "a xxx ce xxxxxxxx ci xxx d";
In this case, if the
matches the most characters possible before the next
, the next regular expression character (
) doesn't match the next character of the string (
). In this case, we get automatic
: the multiplier is unwound and retried, stopping at someplace earlier (in this case, at the earlier
, next to the
] A complex regular expression may involve many such levels of backtracking, leading to long execution times. In this case, making that match lazy (with a trailing "
") will actually simplify the work that Perl has to perform, so you may want to consider that.
Another grouping operator is a pair of open and close
parentheses around any part pattern. This doesn't change whether the pattern matches, but instead causes the part of the string matched by the pattern to be remembered, so that it may be referenced later. So for example,
still matches an
still matches any single lowercase letter.
To recall a memorized part of a string, you must include a
backslash followed by an integer. This pattern construct represents the same sequence of characters matched earlier in the same-numbered pair of parentheses (counting from one). For example,
matches a string consisting of
, followed by any single non-newline character, followed by
, followed by that same single character. So, it matches
, but not
. Compare that with
in which the two unspecified characters can be the same, or different; it doesn't matter.
Where did the
come from? It means the first parenthesized part of the regular expression. If there's more than one, the second part (counting the left parentheses from left to right) is referenced as
, the third as
, and so on. For example,
, a character (call it #1), a
, another character (call it #2), a
, the character #2, a
, and the character #1. So it matches
, for example.
The referenced part can be more than a single character. For example,
, followed by any number of characters (even zero) followed by
, followed by that same sequence of characters followed by
. So, it would match
, or even
, but not
Another grouping construct is
, as in
. This means to match exactly one of the alternatives (
in this case). This works even if the alternatives have multiple characters, as in
, which matches either
. (For single character alternatives, you're definitely better off with a character class like
What if we wanted to match
? We could write
, but that
part shouldn't have to be in there twice. In fact, there's a way out, but we have to talk about the precedence of grouping patterns, which is covered in
Section 7.3.4, "Precedence
Several special notations
anchor a pattern. Normally, when a pattern is matched against the string, the beginning of the pattern is dragged through the string from left to right, matching at the first possible opportunity. Anchors allow you to ensure that parts of the pattern line up with particular parts of the string.
The first pair of anchors require that a particular part of the match be located either at a
word boundary or not at a word boundary. The
anchor requires a word boundary at the indicated point for the pattern to match. A word boundary is the place between characters that match
, or between characters matching
and the beginning or ending of the string. Note that this has little to do with English words and a lot more to do with C symbols, but that's as close as we get. For example:
/fred\b/; # matches fred, but not frederick
/\bmo/; # matches moe and mole, but not Elmo
/\bFred\b/; # matches Fred but not Frederick or alFred
/\b\+\b/; # matches "x+y" but not "++" or " + "
/abc\bdef/; # never matches (impossible for a boundary there)
requires that there not be a word boundary at the indicated point. For example:
/\bFred\B/; # matches "Frederick" but not "Fred Flintstone"
Two more anchors require that a particular part of the pattern be next to an end of the string. The
) matches the beginning of the string if it is in a place that makes sense to match the beginning of the string. For example,
if, and only if, the
is the first character of the string. However,
matches the two characters
anywhere in the string. In other words, the caret has lost its special meaning. If you need the caret to be a literal caret even at the beginning, put a backslash in front of it.
, like the
, anchors the pattern, but to the end of the string, not the beginning. In other words,
only if it occurs at the end of the string.[
] A dollar sign anywhere else in the pattern is probably going to be interpreted as a scalar value interpretation, so you'll most likely need to
backslash it to match a literal dollar sign in the string.
Other anchors are supported, including \A, \Z, and lookahead anchors created via (?=...) and (?!...). These are described fully in
So what happens when we get
together? Is this
any number of times, or is it either a single
or any number of
Well, just as operators have precedence, the grouping and anchoring patterns also have precedence. The precedence of patterns from highest to lowest is given in
According to the table,
has a higher precedence than
is interpreted as a single
, or any number of
What if we want the other meaning, as in "any number of
's"? We simply throw in a pair of
parentheses. In this case, enclose the part of the expression that the
operator should apply to inside parentheses, and we've got it, as
. If you want to clarify the first expression, you can redundantly parenthesize it with
When you use
parentheses to affect precedence they also trigger the memory, as shown earlier in this chapter. That is, this set of parentheses counts when you are figuring out whether something is
, or whatever. If you want to use parentheses without triggering memory, use the form (?:...) instead of (...). This still allows for multipliers, but doesn't throw off your counting by using up
or whatever. For example,
does not store anything into
; it's just there for grouping.
Here are some other examples of regular expressions and the effect of parentheses:
abc* # matches ab, abc, abcc, abccc, abcccc, and so on
(abc)* # matches "", abc, abcabc, abcabcabc, and so on
^x|y # matches x at the beginning of line, or y anywhere
^(x|y) # matches either x or y at the beginning of a line
a|bc|d # a, or bc, or d
(a|b)(c|d) # ac, ad, bc, or bd
(song|blue)bird # songbird or bluebird