4.6. Regular Expressions
Regular expressions are used several
ways in Perl. They're used in conditionals to
determine whether a string matches a particular pattern.
They're also used to find patterns in strings and
replace the match with something else.
The
ordinary pattern match operator looks like
/pattern/.
It matches against the $_ variable by default. If
the pattern is found in the string, the operator returns true
(1); if there is no match, a false value
("") is returned.
The substitution operator looks like
s/pattern/replace/.
This operator searches $_ by default. If it finds
the specified pattern, it is replaced with
the string in replace. If
pattern is not matched, nothing happens.
You may specify a variable other than $_ with the
=~ binding operator (or the negated
!~ binding operator, which returns true if the
pattern is not matched). For example:
$text =~ /sampo/;
4.6.1. Pattern-Matching Operators
The following list defines
Perl's pattern-matching operators. Some of the
operators have alternative
"quoting" schemes and have a set of
modifiers that can be placed directly after the operators to affect
the match operation in some way.
- m/pattern/gimosxe
-
Searches a string for a pattern match. Modifiers
are:
Modifier
|
Meaning
|
g
|
Match globally, i.e., find all occurrences.
|
i
|
Do case-insensitive pattern matching.
|
m
|
Treat string as multiple lines.
|
o
|
Compile pattern only once.
|
s
|
Treat string as single line.
|
x
|
Use extended regular expressions.
|
If / is the delimiter, then the initial
m is optional. With m, you can
use any pair of non-alphanumeric, non-whitespace characters as
delimiters.
- ?pattern?
-
This operator is just like the
m/pattern/
search, except it matches only once.
- qr/pattern/imosx
-
Creates a precompiled regular
expression from pattern, which can be
passed around in variables and interpolated into other regular
expressions. The modifiers are the same as those for
m// above.
- s/pattern/replacement/egimosx
-
Searches a string for
pattern and replaces any match with the
replacement text. Returns the number of
substitutions made, which can be more than one with the
/g modifier. Otherwise, it returns false
(0). If no string is specified via the
=~ or !~ operator, the
$_ variable is searched and modified. Modifiers
are:
Modifier
|
Meaning
|
e
|
Evaluate the right side as an expression.
|
g
|
Replace globally, i.e., all occurrences.
|
cg
|
Continue search after g failed. No longer
supported for s/// as of Perl 5.8.
|
i
|
Do case-insensitive pattern matching.
|
m
|
Treat string as multiple lines.
|
o
|
Compile pattern only once.
|
s
|
Treat string as single line.
|
x
|
Use extended regular expressions.
|
Any non-alphanumeric, non-whitespace delimiter may replace the
slashes. If single quotes are used, no interpretation is done on the
replacement string (the /e modifier overrides
this, however).
- tr/pattern1/pattern2/cds
- y/pattern1/pattern2/cds
-
This operator scans a string character
by character and replaces all occurrences of the characters found in
pattern1 with the corresponding character
in pattern2. It returns the number of
characters replaced or deleted. If no string is specified via the
=~ or !~ operator, the
$_ string is translated. Modifiers are:
Modifier
|
Meaning
|
c
|
Complement pattern1.
|
d
|
Delete found but unreplaced characters.
|
s
|
Squash duplicate replaced characters.
|
4.6.2. Regular Expression Syntax
The simplest kind of
regular expression is a literal string. More complicated patterns
involve the use of metacharacters to describe
all the different choices and variations that you want to build into
a pattern. Metacharacters don't match themselves,
but describe something else. The metacharacters are:
Metacharacter
|
Meaning
|
\
|
Escapes the character(s) immediately following it
|
.
|
Matches any single character except a newline (unless
/s is used)
|
^
|
Matches at the beginning of the string (or line, if
/m is used)
|
$
|
Matches at the end of the string (or line, if /m
is used)
|
*
|
Matches the preceding element 0 or more times
|
+
|
Matches the preceding element 1 or more times
|
?
|
Matches the preceding element 0 or 1 times
|
{...}
|
Specifies a range of occurrences for the element preceding it
|
[...]
|
Matches any one of the class of characters contained within the
brackets
|
(...)
|
Groups regular expressions
|
|
|
Matches either the expression preceding or following it
|
The . (single
dot) is a wildcard character. When used in a regular expression, it
can match any single character. The exception is the newline
character (\n), except when you use the
/s modifier on the pattern match operator. This
modifier treats the string to be matched against as a single
"long" string with embedded
newlines.
The
^ and $ metacharacters are used
as anchors in a regular expression. The ^ matches
the beginning of a line. This character should appear only at the
beginning of an expression to match the beginning of the line. The
exception to this is when the /m (multiline)
modifier is used, in which case it will match at the beginning of the
string and after every newline (except the last, if there is one).
Otherwise, ^ will match itself, unescaped,
anywhere in a pattern, except if it is the first character in a
bracketed character class, in which case it negates the class.
Similarly, $ will match the end of a line (just
before a newline character) only if it is at the end of a pattern,
unless /m is used, in which case it matches just
before every newline and at the end of a string. You need to escape
$ to match a literal dollar sign in all cases,
because if $ isn't at the end of
a pattern (or placed right before a ) or
]), Perl will attempt to do variable
interpretation. The same holds true for the @
sign, which Perl will interpret as an array variable start unless it
is backslashed.
The *,
+, and ? metacharacters are
called quantifiers. They specify the number of
times to match something. They act on the element immediately
preceding them, which could be a single character (including the
.), a grouped expression in parentheses, or a
character class. The {...} construct is a
generalized modifier. You can put two numbers separated by a comma
within the braces to specify minimum and maximum numbers that the
preceding element can match.
Parentheses are used
to group characters or expressions. They also have the side effect of
remembering what they matched so you can recall and reuse patterns
with a special group of variables.
The
| is the alternation operator in regular
expressions. It matches either what's on its left
side or right side. It does not affect only single characters. For
example:
/you|me|him|her/
looks for any of the four words. You should use parentheses to
provide boundaries for alternation:
/And(y|rew)/
This will match either "Andy" or
"Andrew".
4.6.4. Character Classes
The [...] construct
is used to list a set of characters (a character
class) of which one will match.
Brackets are often used when capitalization is uncertain in a match:
/[tT]here/
A dash (-) may be
used to indicate a range of characters in a character class:
/[a-zA-Z]/; # Match any single letter
/[0-9]/; # Match any single digit
To put a literal dash in the list you must use a backslash before it
(\-).
By placing a ^ as
the first element in the brackets, you create a negated character
class, i.e., it matches any character not in the list. For example:
/[^A-Z]/; # Matches any character other than an uppercase letter
Some common character classes have their own predefined escape
sequences for your programming
convenience:
Code
|
Matches
|
\d
|
A digit, same as [0-9]
|
\D
|
A nondigit, same as [^0-9]
|
\w
|
A word character (alphanumeric), same as
[a-zA-Z_0-9]
|
\W
|
A non-word character, [^a-zA-Z_0-9]
|
\s
|
A whitespace character, same as [ \t\n\r\f]
|
\S
|
A non-whitespace character, [^ \t\n\r\f]
|
\C
|
Match a character (byte)
|
\pP
|
Match P-named (Unicode) property
|
\PP
|
Match non-P
|
\X
|
Match extended unicode sequence
|
While Perl implements lc() and uc(
), which you can use for testing the proper case of words
or characters, you can do the same with escape
sequences:
Code
|
Matches
|
\l
|
Lowercase until next character
|
\u
|
Uppercase until next character
|
\L
|
Lowercase until \E
|
\U
|
Uppercase until \E
|
\Q
|
Disable pattern metacharacters until \E
|
\E
|
End case modification
|
These elements match any single element in (or not in) their class. A
\w matches only one character of a word. Using a
modifier, you can match a whole word, for example, with
\w+. The abbreviated classes may also be used
within brackets as elements of other character classes.
4.6.6. Quantifiers
Quantifiers
are used to specify the number of instances of the previous element
that can match. For instance, you could say "match
any number of a's, including none"
(a*), or "match between 5 and 10
instances of the word 'owie'
((owie){5,10})".
Quantifiers,
by nature, are greedy. That is, the way the Perl regular expression
"engine" works is that it will look
for the biggest match possible (the farthest to the right) unless you
tell it not to. Say you are searching a string that reads:
a whatever foo, b whatever foo
and you want to find a and foo
with something in between. You might use:
/a.*foo/
A . followed by a * looks for
any character, any number of times, until foo is
found. But since Perl will look as far to the right as possible to
find foo, the first instance of
foo is swallowed up by the greedy
.* expression.
Therefore,
all the quantifiers have a notation
that allows for minimal matching, so they are nongreedy. This
notation uses a question mark immediately following the quantifier to
force Perl to look for the earliest available match (farthest to the
left). The following table lists the regular expression quantifiers
and their nongreedy forms:
Maximal
|
Minimal
|
Allowed range
|
{n,m}
|
{n,m}?
|
Must occur at least n times but no more than
m times
|
{n,}
|
{n,}?
|
Must occur at least n times
|
{n}
|
{n}?
|
Must match exactly n times
|
*
|
*?
|
0 or more times (same as {0,})
|
+
|
+?
|
1 or more times (same as {1,})
|
?
|
??
|
0 or 1 time (same as {0,1})
|
4.6.7. Pattern Match Variables
Parentheses not only group elements
in a regular expression, they also remember the patterns they match.
Every match from a parenthesized element is saved to a special,
read-only variable indicated by a number. You can recall and reuse a
match by using these variables.
Within a pattern, each
parenthesized element saves its match to a numbered variable, in
order starting with 1. You can recall these
matches within the expression by using \1,
\2, and so on.
Outside of the matching pattern, the matched variables are recalled
with the usual dollar sign, i.e., $1,
$2, etc. The dollar sign notation should be used
in the replacement expression of a substitution and anywhere else you
might want to use the variables in your program. For example, to
implement "i before e, except after
c":
s/([^c])ei/$1ie/g;
The backreferencing variables are:
- $+
-
Returns the last parenthesized pattern match
- $&
-
Returns the entire matched string
- $'
-
Returns everything before the matched string
- $'
-
Returns everything after the matched string
Backreferencing with these variables will slow down your program
noticeably for all regular expressions.
4.6.8. Extended Regular Expressions
Perl defines an
extended syntax for regular expressions. The syntax is a pair of
parentheses with a question mark as the first thing within the
parentheses. The character after the question mark gives the function
of the extension. The extensions are:
- (?#text)
-
A comment. The text is ignored.
- (?:...)
- (?imsx-imsx:...)
-
This groups things like (...) but
doesn't make backreferences.
- (?=...)
-
A zero-width positive lookahead assertion. For example,
/\w+(?=\t)/ matches a word followed by a tab,
without including the tab in $&.
- (?!...)
-
A zero-width negative lookahead assertion. For example,
/foo(?!bar)/ matches any occurrence of
foo that isn't followed by
bar.
- (?<...)
-
A zero-width positive lookbehind assertion. For example,
/(?<bad)boy/ matches the word
boy that follows bad, without
including bad in $&. This
works only for fixed-width lookbehind.
- (?{code})
-
An experimental regular expression feature to evaluate any embedded
Perl code. This evaluation always succeeds, and
code is not interpolated.
- (?<!=...)
-
A zero-width negative lookbehind assertion. For example,
/(?<!=bad)boy/ matches any occurrence of
boy that doesn't follow
bad. This works only for fixed-width lookbehind.
- (?>...)
-
Matches the substring that the standalone pattern would match if
anchored at the given position.
- (?(condition)yes-pattern|no-pattern)
- (?(condition)yes-pattern)
-
Matches a pattern determined by a condition.
condition should be either an integer,
which is true if the pair of parentheses corresponding to the integer
has matched, or a lookahead, lookbehind, or evaluate, zero-width
assertion. no-pattern will be used to
match if the condition was not meant, but it is also optional.
- (?imsx-imsx)
-
One or more embedded pattern-match modifiers. Modifiers are switched
off if they follow a - (dash). The modifiers are
defined as follows:
Modifier
|
Meaning
|
i
|
Do case-insensitive pattern matching.
|
m
|
Treat string as multiple lines.
|
s
|
Treat string as single line.
|
x
|
Use extended regular expressions.
|
 |  |  | 4.5. Operators |  | 4.7. Subroutines |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|