home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book Home Programming PerlSearch this book

5.10. Fancy Patterns

5.10.1. Lookaround Assertions

Sometimes you just need to sneak a peek. There are four regex extensions that help you do just that, and we call them lookaround assertions because they let you scout around in a hypothetical sort of way, without committing to matching any characters. What these assertions assert is that some pattern would (or would not) match if we were to try it. The Engine works it all out for us by actually trying to match the hypothetical pattern, and then pretending that it didn't match (if it did).

When the Engine peeks ahead from its current position in the string, we call it a lookahead assertion. If it peeks backward, we call it a lookbehind assertion. The lookahead patterns can be any regular expression, but the lookbehind patterns may only be fixed width, since they have to know where to start the hypothetical match from.

While these four extensions are all zero-width assertions, and hence do not consume characters (at least, not officially), you can in fact capture substrings within them if you supply extra levels of capturing parentheses.

(?=PATTERN) (positive lookahead)

When the Engine encounters (?=PATTERN), it looks ahead in the string to ensure that PATTERN occurs. If you'll recall, in our earlier duplicate word remover, we had to write a loop because the pattern ate too much each time through:

$_ = "Paris in THE THE THE THE spring.";

# remove duplicate words (and triplicate (and quadruplicate...))
1 while s/\b(\w+) \1\b/$1/gi;
Whenever you hear the phrase "ate too much", you should always think "lookahead assertion". (Well, almost always.) By peeking ahead instead of gobbling up the second word, you can write a one-pass duplicate word remover like this:
s/ \b(\w+) \s (?= \1\b ) //gxi;
Of course, this isn't quite right, since it will mess up valid phrases like "The clothes you DON DON't fit."

(?!PATTERN) (negative lookahead)

When the Engine encounters (?!PATTERN), it looks ahead in the string to ensure that PATTERN does not occur. To fix our previous example, we can add a negative lookahead assertion after the positive assertion to weed out the case of contractions:

s/ \b(\w+) \s (?= \1\b (?! '\w))//xgi;
That final \w is necessary to avoid confusing contractions with words at the ends of single-quoted strings. We can take this one step further, since earlier in this chapter we intentionally used "that that particular", and we'd like our program to not "fix" that for us. So we can add an alternative to the negative lookahead in order to pre-unfix that "that", (thereby demonstrating that any pair of parentheses can be used to cluster alternatives):
s/ \b(\w+) \s (?= \1\b (?! '\w | \s particular))//gix;
Now we know that that particular phrase is safe. Unfortunately, the Gettysburg Address is still broken. So we add another exception:
s/ \b(\w+) \s (?= \1\b (?! '\w | \s particular | \s nation))//igx;
This is just starting to get out of hand. So let's do an Official List of Exceptions, using a cute interpolation trick with the $" variable to separate the alternatives with the | character:
@thatthat = qw(particular nation);
local $" = '|';
s/ \b(\w+) \s (?= \1\b (?! '\w | \s (?: @thatthat )))//xig;

(?<=PATTERN) (positive lookbehind)

When the Engine encounters (?<=PATTERN), it looks backward in the string to ensure that PATTERN already occurred.

Our example still has a problem. Although it now lets Honest Abe say things like "that that nation", it also allows "Paris, in the the nation of France". We can add a positive lookbehind assertion in front of our exception list to make sure that we apply our @thatthat exceptions only to a real "that that".

s/ \b(\w+) \s (?= \1\b (?! '\w | (?<= that) \s (?: @thatthat )))//ixg;
Yes, it's getting terribly complicated, but that's why this section is called "Fancy Patterns", after all. If you need to complicate the pattern any more than we've done so far, judicious use of comments and qr// will help keep you sane. Or at least saner.

(?<!PATTERN) (negative lookbehind)

When the Engine encounters (?<!PATTERN), it looks backward in the string to ensure that PATTERN did not occur.

Let's go for a really simple example this time. How about the easy version of that old spelling rule, "I before E except after C"? In Perl, you spell it:

s/(?<!c)ei/ie/g
You'll have to weigh for yourself whether you want to handle any of the exceptions. (For example, "weird" is spelled weird, especially when you spell it "wierd".)

5.10.2. Nonbacktracking Subpatterns

As described in "The Little Engine That /Could(n't)?/", the Engine often backtracks as it proceeds through the pattern. You can block the Engine from backtracking back through a particular set of choices by creating a nonbacktracking subpattern. A nonbacktracking subpattern looks like (?>PATTERN), and it works exactly like a simple (?:PATTERN), except that once PATTERN has found a match, it suppresses backtracking on any of the quantifiers or alternatives inside the subpattern. (Hence, it is meaningless to use this on a PATTERN that doesn't contain quantifiers or alternatives.) The only way to get it to change its mind is to backtrack to something before the subpattern and reenter the subpattern from the left.

It's like going into a car dealership. After a certain amount of haggling over the price, you deliver an ultimatum: "Here's my best offer; take it or leave it." If they don't take it, you don't go back to haggling again. Instead, you backtrack clear out the door. Maybe you go to another dealership, and start haggling again. You're allowed to haggle again, but only because you reentered the nonbacktracking pattern again in a different context.

For devotees of Prolog or SNOBOL, you can think of this as a scoped cut or fence operator.

Consider how in "aaab" =~ /(?:a*)ab/, the a* first matches three a's, but then gives up one of them because the last a is needed later. The subgroup sacrifices some of what it wants in order for the whole match to succeed. (Which is like letting the car salesman talk you into giving him more of your money because you're afraid to walk away from the deal.) In contrast, the subpattern in "aaab" =~ /(?>a*)ab/ will never give up what it grabs, even though this behavior causes the whole match to fail. (As the song says, you have to know when to hold 'em, when to fold 'em, and when to walk away.)

Although (?>PATTERN) is useful for changing the behavior of a pattern, it's mostly used for speeding up the failure of certain matches that you know will fail anyway (unless they succeed outright). The Engine can take a spectacularly long time to fail, particular with nested quantifiers. The following pattern will succeed almost instantly:

$_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab";
/a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/;
But success is not the problem. Failure is. If you remove that final "b" from the string, the pattern will probably run for many, many years before failing. Many, many millennia. Actually, billions and billions of years.[13] You can see by inspection that the pattern can't succeed if there's no "b" on the end of the string, but the regex optimizer is not smart enough (as of this writing) to figure out that /[b]/ is equivalent to /b/. But if you give it a hint, you can get it to fail quickly while still letting it succeed where it can:
/(?>a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*)[b]/;
For a (hopefully) more realistic example, imagine a program that's supposed to read in a paragraph at a time and show just the lines that are continued, where contination lines are specified with trailing backslashes. Here's a sample from Perl's Makefile that uses this line-continuation convention:
# Files to be built with variable substitution before miniperl
# is available.
sh = Makefile.SH cflags.SH config_h.SH makeaperl.SH makedepend.SH \
        makedir.SH myconfig.SH writemain.SH
You could write your simple program this way:
#!/usr/bin/perl -00p
while ( /( (.+) ( (?<=\\) \n .* )+ ) /gx) {
    print "GOT $.: $1\n\n";
}
That works, but it's really quite slow. That's because the Engine backtracks a character at a time from the end of the line, shrinking what's in $1. This is pointless. And writing it without the extraneous captures doesn't help much. Using:
(.+(?:(?<=\\)\n.*)+)
for a pattern is somewhat faster, but not much. This is where a nonbacktracking subpattern helps a lot. The pattern:
((?>.+)(?:(?<=\\)\n.*)+)
does the same thing, but more than an order of magnitude faster because it doesn't waste time backtracking in search of something that isn't there.

[13] Actually, it's more on the order of septillions and septillions. We don't know exactly how long it would take. We didn't care to wait around watching it not fail. In any event, your computer is likely to crash before the heat death of the universe, and this regular expression takes longer than either of those.

You'll never get a success with (?>...) that you wouldn't get with (?:...) or even a simple (...). But if you're going to fail, it's best to fail quickly and get on with your life.

5.10.3. Programmatic Patterns

Most Perl programs tend to follow an imperative (also called procedural) programming style, like a series of discrete commands laid out in a readily observable order: "Preheat oven, mix, glaze, heat, cool, serve to aliens." Sometimes into this mix you toss a few dollops of functional programming ("Use a little more glaze than you think you need, even after taking this into account, recursively"), or sprinkle it with bits of object-oriented techniques ("but please hold the anchovy objects"). Often it's a combination of all of these.

But the regular expression Engine takes a completely different approach to problem solving, more of a declarative approach. You describe goals in the language of regular expressions, and the Engine implements whatever logic is needed to solve your goals. Logic programming languages (such as Prolog) don't always get as much exposure as the other three styles, but they're more common than you'd think. Perl couldn't even be built without make(1) or yacc(1), both of which could be considered, if not purely declarative languages, at least hybrids that blend imperative and logic programming together.

You can do this sort of thing in Perl, too, by blending goal declarations and imperative code together more miscibly than we've done so far, drawing upon the strengths of both. You can programmatically build up the string you'll eventually present to the regex Engine, in a sense creating a program that writes a new program on the fly.

You can also supply ordinary Perl expressions as the replacement part of s/// via the /e modifier. This allows you to dynamically generate the replacement string by executing a bit of code every time the pattern matches.

Even more elaborately, you can interject bits of code wherever you'd like in a middle of a pattern using the (?{CODE}) extension, and that code will be executed every time the Engine encounters that code as it advances and recedes in its intricate backtracking dance.

Finally, you can use s///ee or (??{CODE}) to add another level of indirection: the results of executing those code snippets will themselves be re-evaluated for further use, creating bits of program and pattern on the fly, just in time.

5.10.3.1. Generated patterns

It has been said[14] that programs that write programs are the happiest programs in the world. In Jeffrey Friedl's book, Mastering Regular Expressions, the final tour de force demonstrates how to write a program that produces a regular expression to determine whether a string conforms to the RFC 822 standard; that is, whether it contains a standards-compliant, valid mail header. The pattern produced is several thousand characters long, and about as easy to read as a crash dump in pure binary. But Perl's pattern matcher doesn't care about that; it just compiles up the pattern without a hitch and, even more interestingly, executes the match very quickly--much more quickly, in fact, than many short patterns with complex backtracking requirements.

[14] By Andrew Hume, the famous Unix philosopher.

That's a very complicated example. Earlier we showed you a very simple example of the same technique when we built up a $number pattern out of its components (see the section Section 5.9.2, "Variable Interpolation"). But to show you the power of this programmatic approach to producing a pattern, let's work out a problem of medium complexity.

Suppose you wanted to pull out all the words with a certain vowel-consonant sequence; for example, "audio" and "eerie" both follow a VVCVV pattern. Although describing what counts as a consonant or a vowel is easy, you wouldn't ever want to type that in more than once. Even for our simple VVCVV case, you'd need to type in a pattern that looked something like this:

^[aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy]$
A more general-purpose program would accept a string like "VVCVV" and programmatically generate that pattern for you. For even more flexibility, it could accept a word like "audio" as input and use that as a template to infer "VVCVV", and from that, the long pattern above. It sounds complicated, but really isn't, because we'll let the program generate the pattern for us. Here's a simple cvmap program that does all of that:
#!/usr/bin/perl
$vowels = 'aeiouy';
$cons   = 'cbdfghjklmnpqrstvwxzy';
%map = (C => $cons, V => $vowels);  # init map for C and V

for $class ($vowels, $cons) {       # now for each type
    for (split //, $class) {        # get each letter of that type
        $map{$_} .= $class;         # and map the letter back to the type
    }
}

for $char (split //, shift) {       # for each letter in template word
    $pat .= "[$map{$char}]";        # add appropriate character class
}

$re = qr/^${pat}$/i;                # compile the pattern
print "REGEX is $re\n";             # debugging output
@ARGV = ('/usr/dict/words')         # pick a default dictionary
    if -t && !@ARGV;

while (<>) {                        # and now blaze through the input
    print if /$re/;                 # printing any line that matches
}
The %map variable holds all the interesting bits. Its keys are each letter of the alphabet, and the corresponding value is all the letters of its type. We throw in C and V, too, so you can specify either "VVCVV" or "audio", and still get out "eerie". Each character in the argument supplied to the program is used to pull out the right character class to add to the pattern. Once the pattern is created and compiled up with qr//, the match (even a very long one) will run quickly. Here's why you might get if you run this program on "fortuitously":
% cvmap fortuitously /usr/dict/wordses
REGEX is (?i-xsm:^[cbdfghjklmnpqrstvwxzy][aeiouy][cbdfghjklmnpqrstvwxzy][cbd
fghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][c
bdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouycbdfghjklmnpqrstvwxzy]$)
carriageable
circuitously
fortuitously
languorously
marriageable
milquetoasts
sesquiquarta
sesquiquinta
villainously
Looking at that REGEX, you can see just how much villainous typing you saved by programming languorously, albeit circuitously.

5.10.3.2. Substitution evaluations

When the /e modifier ("e" is for expression evaluation) is used on an s/PATTERN/CODE/e expression, the replacement portion is interpreted as a Perl expression, not just as a double-quoted string. It's like an embedded do {CODE}. Even though it looks like a string, it's really just a code block that gets compiled up at the same time as rest of your program, long before the substitution actually happens.

You can use the /e modifier to build replacement strings with fancier logic than double-quote interpolation allows. This shows the difference:

s/(\d+)/$1 * 2/;     # Replaces "42" with "42 * 2"
s/(\d+)/$1 * 2/e;    # Replaces "42" with "84"
And this converts Celsius temperatures into Fahrenheit:
$_ = "Preheat oven to 233C.\n";
s/\b(\d+\.?\d*)C\b/int($1 * 1.8 + 32) . "F"/e;   # convert to 451F
Applications of this technique are limitless. Here's a filter that modifies its files in place (like an editor) by adding 100 to every number that starts a line (and that is followed by a colon, which we only peek at, but don't actually match, or replace):
% perl -pi -e 's/^(\d+)(?=:)/100 + $1/e' filename
Now and then, you want to do more than just use the string you matched in another computation. Sometimes you want that string to be a computation, whose own evaluation you'll use for the replacement value. Each additional /e modifier after the first wraps an eval around the code to execute. The following two lines do the same thing, but the first one is easier to read:
s/PATTERN/CODE/ee
s/PATTERN/eval(CODE)/e
You could use this technique to replace mentions of simple scalar variables with their values:
s/(\$\w+)/$1/eeg;       # Interpolate most scalars' values
Because it's really an eval, the /ee even finds lexical variables. A slightly more elaborate example calculates a replacement for simple arithmetical expressions on (nonnegative) integers:
$_ = "I have 4 + 19 dollars and 8/2 cents.\n";
s{ (
        \d+ \s*         # find an integer
        [+*/-]          # and an arithmetical operator
        \s* \d+         # and another integer
   )
}{ $1 }eegx;            # then expand $1 and run that code
print;                  # "I have 23 dollars and 4 cents."
Like any other evalSTRING, compile-time errors (like syntax problems) and run-time exceptions (like dividing by zero) are trapped. If so, the $@ ($EVAL_ERROR) variable says what went wrong.

5.10.3.3. Match-time code evaluation

In most programs that use regular expressions, the surrounding program's run-time control structure drives the logical execution flow. You write if or while loops, or make function or method calls, that wind up calling a pattern-matching operation now and then. Even with s///e, it's the substitution operator that is in control, executing the replacement code only after a successful match.

With code subpatterns, the normal relationship between regular expression and program code is inverted. As the Engine is applying its Rules to your pattern at match time, it may come across a regex extension of the form (?{CODE}). When triggered, this subpattern doesn't do any matching or any looking about. It's a zero-width assertion that always "succeeds", evaluated only for its side effects. Whenever the Engine needs to progress over the code subpattern as it executes the pattern, it runs that code.

"glyph" =~ /.+ (?{ print "hi" }) ./x;  # Prints "hi" twice.
As the Engine tries to match glyph against this pattern, it first lets the .+ eat up all five letters. Then it prints "hi". When it finds that final dot, all five letters have been eaten, so it needs to backtrack back to the .+ and make it give up one of the letters. Then it moves forward through the pattern again, stopping to print "hi" again before assigning h to the final dot and completing the match successfully.

The braces around the CODE fragment are intended to remind you that it is a block of Perl code, and it certainly behaves like a block in the lexical sense. That is, if you use my to declare a lexically scoped variable in it, it is private to the block. But if you use local to localize a dynamically scoped variable, it may not do what you expect. A (?{ CODE}) subpattern creates an implicit dynamic scope that is valid throughout the rest of the pattern, until it either succeeds or backtracks through the code subpattern. One way to think of it is that the block doesn't actually return when it gets to the end. Instead, it makes an invisible recursive call to the Engine to try to match the rest of the pattern. Only when that recursive call is finished does it return from the block, delocalizing the localized variables.[15]

[15] People who are familiar with recursive descent parsers may find this behavior confusing because such compilers return from a recursive function call whenever they figure something out. The Engine doesn't do that--when it figures something out, it goes deeper into recursion (even when exiting a parenthetical group!). A recursive descent parser is at a minimum of recursion when it succeeds at the end, but the Engine is at a local maximum of recursion when it succeeds at the end of the pattern. You might find it helpful to dangle the pattern from its left end and think of it as a skinny representation of a call graph tree. If you can get that picture into your head, the dynamic scoping of local variables will make more sense. (And if you can't, you're no worse off than before.)

In the next example, we initialize $i to 0 by including a code subpattern at the beginning of the pattern. Then we match any number of characters with .*--but we place another code subpattern in between the . and the * so we can count how many times . matches.

$_ = 'lothlorien';
m/  (?{ $i = 0 })                    # Set $i to 0
    (.    (?{ $i++ })    )*          # Update $i, even after backtracking
    lori                             # Forces a backtrack
 /x;
The Engine merrily goes along, setting $i to 0 and letting the .* gobble up all 10 characters in the string. When it encounters the literal lori in the pattern, it backtracks and gives up those four characters from the .*. After the match, $i will still be 10.

If you wanted $i to reflect how many characters the .* actually ended up with, you could make use of the dynamic scope within the pattern:

$_ = 'lothlorien';
m/  (?{ $i = 0 })
    (. (?{ local $i = $i + 1; }) )*  # Update $i, backtracking-safe.
    lori
    (?{ $result = $i })              # Copy to non-localized location.
 /x;
Here, we use local to ensure that $i contains the number of characters matched by .*, regardless of backtracking. $i will be forgotten after the regular expression ends, so the code subpattern, (?{ $result = $i }), ensures that the count will live on in $result.

The special variable $^R (described in Chapter 28, "Special Names") holds the result of the last (?{CODE}) that was executed as part of a successful match.

You can use a (?{CODE}) extension as the COND of a (?(COND)IFTRUE|IFFALSE). If you do this, $^R will not be set, and you may omit the parentheses around the conditional:

"glyph" =~ /.+(?(?{ $foo{bar} gt "symbol" }).|signet)./;
Here, we test whether $foo{bar} is greater than symbol. If so, we include . in the pattern, and if not, we include signet in the pattern. Stretched out a bit, it might be construed as more readable:
"glyph" =~ m{
    .+                              # some anythings
    (?(?{                           # if 
            $foo{bar} gt "symbol"   # this is true
        })
            .                       # match another anything
        |                           # else
            signet                  # match signet
    )
    .                               # and one more anything
}x;
When use re 'eval' is in effect, a regex is allowed to contain (?{CODE}) subpatterns even if the regular expression interpolates variables:
/(.*?) (?{length($1) < 3 && warn}) $suffix/;  # Error without use re 'eval'
This is normally disallowed since it is a potential security risk. Even though the pattern above may be innocuous because $suffix is innocuous, the regex parser can't tell which parts of the string were interpolated and which ones weren't, so it just disallows code subpatterns entirely if there were any interpolations.

If the pattern is obtained from tainted data, even use re 'eval' won't allow the pattern match to proceed.

When use re 'taint' is in effect and a tainted string is the target of a regex, the captured subpatterns (either in the numbered variables or in the list of values returned by m// in list context) are tainted. This is useful when regex operations on tainted data are meant not to extract safe substrings, but merely to perform other transformations. See Chapter 23, "Security", for more on tainting. For the purpose of this pragma, precompiled regular expressions (usually obtained from qr//) are not considered to be interpolated:

/foo${pat}bar/
This is allowed if $pat is a precompiled regular expression, even if $pat contains (?{CODE}) subpatterns.

Earlier we showed you a bit of what usere'debug' prints out. A more primitive debugging solution is to use (?{CODE}) subpatterns to print out what's been matched so far during the match:

"abcdef" =~ / .+ (?{print "Matched so far: $&\n"}) bcdef $/x;
This prints:
Matched so far: abcdef
Matched so far: abcde
Matched so far: abcd
Matched so far: abc
Matched so far: ab
Matched so far: a
showing the .+ grabbing all the letters and giving them up one by one as the Engine backtracks.

5.10.3.4. Match-time pattern interpolation

You can build parts of your pattern from within the pattern itself. The (??{ CODE }) extension allows you to insert code that evaluates to a valid pattern. It's like saying /$pattern/, except that you can generate $pattern at run time--more specifically, at match time. For instance:

/\w (??{ if ($threshold > 1) { "red" } else { "blue" } }) \d/x;
This is equivalent to /\wred\d/ if $threshold is greater than 1, and /\wblue\d/ otherwise.

You can include backreferences inside the evaluated code to derive patterns from just-matched substrings (even if they will later become unmatched through backtracking). For instance, this matches all strings that read the same backward as forward (known as palindromedaries, phrases with a hump in the middle):

/^ (.+) .? (??{quotemeta reverse $1}) $/xi;
You can balance parentheses like so:
$text =~ /( \(+ ) (.*?) (??{ '\)' x length $1 })/x;
This matches strings of the form (shazam!) and (((shazam!))), sticking shazam! into $2. Unfortunately, it doesn't notice whether the parentheses in the middle are balanced. For that we need recursion.

Fortunately, you can do recursive patterns too. You can have a compiled pattern that uses (??{CODE}) to refer to itself. Recursive matching is pretty irregular, as regular expressions go. Any text on regular expressions will tell you that a standard regex can't match nested parentheses correctly. And that's correct. It's also correct that Perl's regexes aren't standard. The following pattern[16] matches a set of nested parentheses, however deep they go:

$np = qr{
           \(
           (?:
              (?> [^()]+ )    # Non-parens without backtracking
            |
              (??{ $np })     # Group with matching parens
           )*
           \)
        }x;
You could use it like this to match a function call:
$funpat = qr/\w+$np/;
'myfunfun(1,(2*(3+4)),5)' =~ /^$funpat$/;   # Matches!

[16]Note that you can't declare the variable in the same statement in which you're going to use it. You can always declare it earlier, of course.

5.10.3.5. Conditional interpolation

The (?(COND)IFTRUE|IFFALSE) regex extension is similar to Perl's ?: operator. If COND is true, the IFTRUE pattern is used; otherwise, the IFFALSE pattern is used. The COND can be a backreference (expressed as a bare integer, without the \ or $), a lookaround assertion, or a code subpattern. (See Section 5.10.1, "Lookaround Assertions" and Section 5.10.3.3, "Match-time code evaluation" earlier in this chapter.)

If the COND is an integer, it is treated as a backreference. For instance, consider:

#!/usr/bin/perl
$x = 'Perl is free.';
$y = 'ManagerWare costs $99.95.';

foreach ($x, $y) {
    /^(\w+) (?:is|(costs)) (?(2)(\$\d+)|\w+)/;  # Either (\$\d+) or \w+
    if ($3) {
        print "$1 costs money.\n";         # ManagerWare costs money.
    } else {
        print "$1 doesn't cost money.\n";  # Perl doesn't cost money.
    }
}
Here, the COND is (2), which is true if a second backreference exists. If that's the case, (\$\d+) is included in the pattern at that point (creating the $3 backreference); otherwise, \w+ is used.

If the COND is a lookaround or code subpattern, the truth of the assertion is used to determine whether to include IFTRUE or IFFALSE:

/[ATGC]+(?(?<=AA)G|C)$/;
This uses a lookbehind assertion as the COND to match a DNA sequence that ends in either AAG, or some other base combination and C.

You can omit the |IFFALSE alternative. If you do, the IFTRUE pattern will be included in the pattern as usual if the COND is true, but if the condition isn't true, the Engine will move on to the next portion of the pattern.

5.10.4. Defining Your Own Assertions

You can't change how Perl's Engine works, but if you're sufficiently warped, you can change how it sees your pattern. Since Perl interprets your pattern similarly to double-quoted strings, you can use the wonder of overloaded string constants to see to it that text sequences of your choosing are automatically translated into other text sequences.

In the example below, we specify two transformations to occur when Perl encounters a pattern. First, we define \tag so that when it appears in a pattern, it's automatically translated to (?:<.*?>), which matches most HTML and XML tags. Second, we "redefine" the \w metasymbol so that it handles only English letters.

We'll define a package called Tagger that hides the overloading from our main program. Once we do that, we'll be able to say:

use Tagger;
$_ = '<I>camel</I>';
print "Tagged camel found" if /\tag\w+\tag/;
Here's Tagger.pm, couched in the form of a Perl module (see Chapter 11, "Modules"):
package Tagger;
use overload;

sub import { overload::constant 'qr' => \&convert }

sub convert {
    my $re = shift;
    $re =~ s/ \\tag  /<.*?>/xg;
    $re =~ s/ \\w    /[A-Za-z]/xg;
    return $re;
}

1;
The Tagger module is handed the pattern immediately before interpolation, so you can bypass the overloading by bypassing interpolation, as follows:
$re = '\tag\w+\tag';   # This string begins with \t, a tab
print if /$re/;        # Matches a tab, followed by an "a"...
If you wanted the interpolated variable to be customized, call the convert function directly:
$re = '\tag\w+\tag';         # This string begins with \t, a tab
$re = Tagger::convert $re;   # expand \tag and \w
print if /$re/;              # $re becomes <.*?>[A-Za-z]+<.*?>
Now if you're still wondering what those sub thingies are there in the Tagger module, you'll find out soon enough because that's what our next chapter is all about.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.