home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Book Home Programming PerlSearch this book

5.8. Alternation

Inside a pattern or subpattern, use the | metacharacter to specify a set of possibilities, any one of which could match. For instance:

matches Gandalf or Saruman or Radagast. The alternation extends only as far as the innermost enclosing parentheses (whether capturing or not):
/prob|n|r|l|ate/    # Match prob, n, r, l, or ate
/pro(b|n|r|l)ate/   # Match probate, pronate, prorate, or prolate
/pro(?:b|n|r|l)ate/ # Match probate, pronate, prorate, or prolate
The second and third forms match the same strings, but the second form captures the variant character in $1 and the third form does not.

At any given position, the Engine tries to match the first alternative, and then the second, and so on. The relative length of the alternatives does not matter, which means that in this pattern:

$1 will never be set to Samwise no matter what string it's matched against, because Sam will always match first. When you have overlapping matches like this, put the longer ones at the beginning.

But the ordering of the alternatives only matters at a given position. The outer loop of the Engine does left-to-right matching, so the following always matches the first Sam:

"'Sam I am,' said Samwise" =~ /(Samwise|Sam)/;   # $1 eq "Sam"
But you can force right-to-left scanning by making use of greedy quantifiers, as discussed earlier in "Quantifiers":
"'Sam I am,' said Samwise" =~ /.*(Samwise|Sam)/; # $1 eq "Samwise"
You can defeat any left-to-right (or right-to-left) matching by including any of the various positional assertions we saw earlier, such as \G, ^, and $. Here we anchor the pattern to the end of the string:
"'Sam I am,' said Samwise" =~ /(Samwise|Sam)$/;  # $1 eq "Samwise"
That example factors the $ out of the alternation (since we already had a handy pair of parentheses to put it after), but in the absence of parentheses you can also distribute the assertions to any or all of the individual alternatives, depending on how you want them to match. This little program displays lines that begin with either a __DATA__ or __END__ token:
while (<>) {
    print if /^__DATA__|^__END__/;
But be careful with that. Remember that the first and last alternatives (before the first | and after the last one) tend to gobble up the other elements of the regular expression on either side, out to the ends of the expression, unless there are enclosing parentheses. A common mistake is to ask for:
when you really mean:
The first matches "cat" at the beginning of the string, or "dog" anywhere, or "cow" at the end of the string. The second matches any string consisting solely of "cat" or "dog" or "cow". It also captures $1, which you may not want. You can also say:
We'll show you another solution later.

An alternative can be empty, in which case it always matches.

/com(pound|)/;      # Matches "compound" or "com"
/com(pound(s|)|)/;  # Matches "compounds", "compound", or "com"
This is much like using the ? quantifier, which matches 0 times or 1 time:
/com(pound)?/;      # Matches "compound" or "com"
/com(pound(s?))?/;  # Matches "compounds", "compound", or "com"
/com(pounds?)?/;    # Same, but doesn't use $2
There is one difference, though. When you apply the ? to a subpattern that captures into a numbered variable, that variable will be undefined if there's no string to go there. If you used an empty alternative, it would still be false, but would be a defined null string instead.

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.