4.6. Regular ExpressionsRegular expressions are used several ways in Perl. They're used in conditionals to determine whether a string matches a particular pattern. They're also used to find patterns in strings and replace the match with something else. The ordinary pattern match operator looks like /pattern/. It matches against the $_ variable by default. If the pattern is found in the string, the operator returns true (1); if there is no match, a false value ("") is returned. The substitution operator looks like s/pattern/replace/. This operator searches $_ by default. If it finds the specified pattern, it is replaced with the string in replace. If pattern is not matched, nothing happens. You may specify a variable other than $_ with the =~ binding operator (or the negated !~ binding operator, which returns true if the pattern is not matched). For example: $text =~ /sampo/; 4.6.1. Pattern-Matching OperatorsThe following list defines Perl's pattern-matching operators. Some of the operators have alternative "quoting" schemes and have a set of modifiers that can be placed directly after the operators to affect the match operation in some way.
4.6.2. Regular Expression SyntaxThe simplest kind of regular expression is a literal string. More complicated patterns involve the use of metacharacters to describe all the different choices and variations that you want to build into a pattern. Metacharacters don't match themselves, but describe something else. The metacharacters are:
The . (single dot) is a wildcard character. When used in a regular expression, it can match any single character. The exception is the newline character (\n), except when you use the /s modifier on the pattern match operator. This modifier treats the string to be matched against as a single "long" string with embedded newlines. The ^ and $ metacharacters are used as anchors in a regular expression. The ^ matches the beginning of a line. This character should appear only at the beginning of an expression to match the beginning of the line. The exception to this is when the /m (multiline) modifier is used, in which case it will match at the beginning of the string and after every newline (except the last, if there is one). Otherwise, ^ will match itself, unescaped, anywhere in a pattern, except if it is the first character in a bracketed character class, in which case it negates the class. Similarly, $ will match the end of a line (just before a newline character) only if it is at the end of a pattern, unless /m is used, in which case it matches just before every newline and at the end of a string. You need to escape $ to match a literal dollar sign in all cases, because if $ isn't at the end of a pattern (or placed right before a ) or ]), Perl will attempt to do variable interpretation. The same holds true for the @ sign, which Perl will interpret as an array variable start unless it is backslashed. The *, +, and ? metacharacters are called quantifiers. They specify the number of times to match something. They act on the element immediately preceding them, which could be a single character (including the .), a grouped expression in parentheses, or a character class. The {...} construct is a generalized modifier. You can put two numbers separated by a comma within the braces to specify minimum and maximum numbers that the preceding element can match. Parentheses are used to group characters or expressions. They also have the side effect of remembering what they matched so you can recall and reuse patterns with a special group of variables. The | is the alternation operator in regular expressions. It matches either what's on its left side or right side. It does not affect only single characters. For example: /you|me|him|her/ looks for any of the four words. You should use parentheses to provide boundaries for alternation: /And(y|rew)/ This will match either "Andy" or "Andrew". 4.6.3. Escaped SequencesThe following table lists the backslashed representations of characters that you can use in regular expressions:
4.6.4. Character ClassesThe [...] construct is used to list a set of characters (a character class) of which one will match. Brackets are often used when capitalization is uncertain in a match: /[tT]here/ A dash (-) may be used to indicate a range of characters in a character class: /[a-zA-Z]/; # Match any single letter /[0-9]/; # Match any single digit To put a literal dash in the list you must use a backslash before it (\-). By placing a ^ as the first element in the brackets, you create a negated character class, i.e., it matches any character not in the list. For example: /[^A-Z]/; # Matches any character other than an uppercase letter Some common character classes have their own predefined escape sequences for your programming convenience:
While Perl implements lc() and uc( ), which you can use for testing the proper case of words or characters, you can do the same with escape sequences:
These elements match any single element in (or not in) their class. A \w matches only one character of a word. Using a modifier, you can match a whole word, for example, with \w+. The abbreviated classes may also be used within brackets as elements of other character classes. 4.6.5. AnchorsAnchors don't match any characters; they match places within a string. The two most common anchors are ^ and $, which match the beginning and end of a line, respectively. The following table lists the anchoring patterns used to match certain boundaries in regular expressions:
The $ and \Z assertions can match not only at the end of the string, but also one character earlier than that, if the last character of the string is a newline. 4.6.6. QuantifiersQuantifiers are used to specify the number of instances of the previous element that can match. For instance, you could say "match any number of a's, including none" (a*), or "match between 5 and 10 instances of the word 'owie' ((owie){5,10})". Quantifiers, by nature, are greedy. That is, the way the Perl regular expression "engine" works is that it will look for the biggest match possible (the farthest to the right) unless you tell it not to. Say you are searching a string that reads: a whatever foo, b whatever foo and you want to find a and foo with something in between. You might use: /a.*foo/ A . followed by a * looks for any character, any number of times, until foo is found. But since Perl will look as far to the right as possible to find foo, the first instance of foo is swallowed up by the greedy .* expression. Therefore, all the quantifiers have a notation that allows for minimal matching, so they are nongreedy. This notation uses a question mark immediately following the quantifier to force Perl to look for the earliest available match (farthest to the left). The following table lists the regular expression quantifiers and their nongreedy forms:
4.6.7. Pattern Match VariablesParentheses not only group elements in a regular expression, they also remember the patterns they match. Every match from a parenthesized element is saved to a special, read-only variable indicated by a number. You can recall and reuse a match by using these variables. Within a pattern, each parenthesized element saves its match to a numbered variable, in order starting with 1. You can recall these matches within the expression by using \1, \2, and so on. Outside of the matching pattern, the matched variables are recalled with the usual dollar sign, i.e., $1, $2, etc. The dollar sign notation should be used in the replacement expression of a substitution and anywhere else you might want to use the variables in your program. For example, to implement "i before e, except after c": s/([^c])ei/$1ie/g; The backreferencing variables are:
Backreferencing with these variables will slow down your program noticeably for all regular expressions. 4.6.8. Extended Regular ExpressionsPerl defines an extended syntax for regular expressions. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses. The character after the question mark gives the function of the extension. The extensions are:
Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|