Regular expressions are used several ways in Perl. They're used in conditionals
to determine whether a string matches a particular pattern. They're also
used to find patterns in strings and replace the match with something else.
The ordinary pattern match operator looks like
/
pattern
/
.
It matches against the
$_
variable by default. If the pattern is found
in the string, the operator returns true (
"1"
); if there is no
match, a false value (
""
) is returned.
The substitution operator looks like
s/
pattern
/
replace
/
.
This operator searches
$_
by default. If it finds the specified
pattern
,
it is replaced with the string in
replace
. If
pattern
is not
matched, nothing happens.
You may specify a variable other
than
$_
with the
=~
binding operator (or the negated
!~
binding operator, which returns true if the pattern is not matched).
For example:
$text =~ /sampo/;
The following list defines Perl's pattern-matching operators. Some of the
operators have alternative "quoting" schemes and have a set of modifiers
that can be placed directly after the operators to affect the match
operation in some way.
-
m/
pattern
/gimosx
-
Searches a string for a pattern match.
Modifiers are:
If
/
is the delimiter, then the initial
m
is optional.
With the
m
,
you can use any pair of non-alphanumeric, non-whitespace characters as
delimiters.
-
?
pattern
?
-
This operator is just like the
m/
pattern
/
search, except it matches only once.
-
qr/
pattern
/imosx
-
Creates a precompiled regular expression from
pattern
,
which can be passed
around in variables and interpolated into other regular expressions.
The modifiers are the same as those for
m//
above.
-
s/
pattern
/
replacement
/egimosx
-
Searches a string for
pattern
, and replaces
any match with the
replacement
text. Returns the number of
substitutions made, which can be more than one with the
/g
modifier.
Otherwise, it returns false (0).
If no string is specified via the
=~
or
!~
operator, the
$_
variable is searched and modified.
Modifiers are:
Any non-alphanumeric, non-whitespace delimiter may replace the slashes.
If single quotes are used, no interpretation is done on the replacement
string (the
/e
modifier overrides this, however).
-
tr/
pattern1
/
pattern2
/cds
-
y/
pattern1
/
pattern2
/cds
-
This operator scans
a string, character by character, and replaces
all occurrences of the characters found in
pattern1
with the corresponding character in
pattern2
. It returns
the number of characters replaced or deleted. If no string is
specified via the
=~
or
!~
operator, the
$_
string is translated.
Modifiers are:
The simplest kind of regular expression is a literal string.
More complicated patterns involve the use of
metacharacters
to
describe all the different choices and variations that you want to
build into a pattern. Metacharacters don't match themselves, but describe
something else. The metacharacters are:
The "
.
" (single dot) is a wildcard character. When
used in a regular expression, it can match any single character.
The exception is the newline character (
\n
), except when
you use the
/s
modifier on the pattern match operator. This
modifier treats the string to be matched against as a single "long"
string with embedded newlines.
The
^
and
$
metacharacters are used as anchors
in a regular expression.
The
^
matches the beginning of a line. This
character should only appear at the beginning of an expression to match
the line beginning. The exception to this is when the
/m
(multi-line) modifier is used, in which case it will match at the
beginning of the string and after every newline (except the last, if
there is one). Otherwise,
^
will match itself, unescaped, anywhere
in a pattern, except if it is the first character in a bracketed character
class, in which case it negates the class.
Similarly,
$
will match the end of a line (just before a newline
character) only if it is at the end of a pattern, unless
/m
is
used, in which case it matches just before every newline and at the
end of a string. You need to escape
$
to match a literal dollar
sign in all cases, because if
$
isn't at the end of a pattern
(or placed right before a
)
or
]
), Perl will attempt
to do variable interpretation. The same holds true for the
@
sign, which Perl will interpret as an array variable start unless
it is backslashed.
The
*
,
+
, and
?
metacharacters are called
quantifiers
. They specify the number of times to match something.
They act on the element immediately preceding them, which could
be a single character (including the
.
), a
grouped expression in parentheses, or a character class. The
{...}
construct is a generalized modifier. You may
put two numbers separated by a comma within the braces to
specify minimum and maximum numbers that the preceding element
can match.
Parentheses are used to group characters or expressions. They also
have the side effect of remembering what they matched so you can
recall and reuse patterns with a special group of variables.
The
|
is the alternation operator in regular expressions.
It matches either what's on its left side or right side. It does
not only affect single characters. For example:
/you|me|him|her/
looks for any of the four words. You should use parentheses to
provide boundaries for alternation:
/And(y|rew)/
This will match either "Andy" or "Andrew".
The following table lists the backslashed representations of characters
that you can use in regular expressions:
The
[...]
construct is used to list a set of characters
(a
character class
) of which
one
will match.
Brackets are often used when
capitalization is uncertain in a match:
/[tT]here/
A dash (
-
) may be used to indicate a range of characters
in a character class:
/[a-zA-Z]/; # match any single letter
/[0-9]/; # match any single digit
To put a literal dash in the list you must use a backslash before it
(
\-
).
By placing a
^
as the first element in the brackets, you
create a negated character class, i.e., it matches any character
not in the list. For example:
/[^A-Z]/; matches any character other than an uppercase letter
Some common character classes have their own predefined escape sequences
for your programming convenience:
These elements match any single element in (or not in) their class.
A
\w
matches only one character of a word. Using a modifier,
you can match a whole word, for example, with
\w+
.
The abbreviated classes may also be used within brackets as
elements of other character classes.
Anchors
don't match any characters; they match places within a string. The two
most common anchors are
^
and
$
, which match
the beginning and end of a line, respectively.
This table lists the anchoring patterns used to match certain
boundaries in regular expressions:
The
$
and
\Z
assertions can match not only at the end of the
string, but also one character earlier than that, if the last character
of the string happens to be a newline.
Quantifiers are used to specify how many instances of the previous
element can match. For instance, you could say "match any
number of a's, including none" (
a*
), or match between five and ten
instances of the word "owie" (
(owie){5,10}
).
Quantifiers, by nature, are greedy. That is, the way the
Perl regular expression "engine" works is that it will
look for the biggest match possible (the farthest to the right)
unless you tell it not to. Say you are searching a string that
reads:
a whatever foo, b whatever foo
and you want to find
a
and
foo
with something
in between. You might use:
/a.*foo/
A
.
followed by a
*
looks for any character,
any number of times, until
foo
is found. But since Perl
will look as far to the right as possible to find
foo
,
the first instance of
foo
is swallowed up by the greedy
.*
expression.
All the quantifiers therefore have a notation that allows for
minimal matching, so they are non-greedy. This notation uses
a question mark immediately following the quantifier to force
Perl to look for the earliest available match (farthest to the
left). The following table lists the regular expression quantifiers and
their non-greedy forms:
Parentheses not only serve to group elements in a regular expression,
they also remember the patterns they match.
Every match from a parenthesized
element is saved to a special, read-only variable indicated by a number.
You can recall and reuse a match by using these variables.
Within a pattern, each parenthesized element saves its match to a numbered
variable, in order starting with
1
. You can recall these
matches within the expression by using
\1
,
\2
, and
so on.
Outside of the matching pattern, the matched variables are recalled
with the usual dollar-sign, i.e.,
$1
,
$2
, etc. The
dollar sign notation should used in the replacement expression
of a substitution and anywhere else you might want to use them in
your program. For example, to implement "i before e, except after c":
s/([^c])ei/$1ie/g;
The backreferencing variables are:
-
$+
-
Returns the last parenthesized pattern match
-
$&
-
Returns the entire matched string
-
$`
-
Returns everything before the matched string
-
$'
-
Returns everything after the matched string
Backreferencing with these variables will slow down your program noticeably for
all regular expressions.
Perl defines an extended syntax for regular expressions.
The syntax is a pair of parentheses with a question mark as the first thing
within the parentheses.
The character after the question mark gives the function of the extension.
The extensions are:
-
(?#text)
-
A comment. The text is ignored.
-
(?:...)
-
This groups things like "
(...)
" but doesn't make backreferences.
-
(?=...)
-
A zero-width positive lookahead assertion. For example,
/\w+(?=\t)/
matches a word followed by a tab, without including the tab in
$&
.
-
(?!...)
-
A zero-width negative lookahead assertion. For example,
/foo(?!bar)/
matches any occurrence of "foo" that isn't followed by "bar".
-
(?<=...)
-
A zero-width positive lookbehind assertion. For example,
/(?<=bad)boy/
matches the word
boy
that
follows
bad
, without including
bad
in
$&
. This only works for fixed-width lookbehind.
-
(?<!=...)
-
A zero-width negative lookbehind assertion. For example,
/(?<!=bad)boy/
matches any occurrence of "boy" that
doesn't follow "bad". This only works for fixed-width lookbehind.
-
(?>...)
-
Matches the substring that the standalone pattern would
match if anchored at the given position.
-
(?(
condition
)
yes-pattern
|
no-pattern
)
-
(?(
condition
)
yes-pattern
)
-
Matches a pattern determined by a condition. The
condition
should be either an integer, which is "true" if the
pair of parentheses corresponding to the integer has matched,
or a lookahead, lookbehind, or evaluate, zero-width assertion.
The
no-pattern
will be used to match if the condition was
not meant, but it is also optional.
-
(?imsx-imsx)
-
One or more embedded pattern-match modifiers.
Modifiers are switched off if they follow a
-
(dash).
The modifiers are defined as follows:
Copyright © 2001 O'Reilly & Associates. All rights reserved. |