6.15. Greedy and Non-Greedy MatchesProblem
You have a pattern with a greedy quantifier like
A classic case of this is the naïve substitution to remove tags from HTML. Although it looks appealing, Solution
Replace the offending greedy quantifier with the corresponding non-greedy version. That is, change Discussion
Perl has two sets of quantifiers: the
maximal
ones
With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for # greedy pattern s/<.*>//gs; # try to remove tags, very badly # non-greedy pattern s/<.*?>//gs; # try to remove tags, still rather badly This approach doesn't remove tags from all possible HTML correctly, because a single regular expression is not an acceptable replacement for a real parser. See Recipe 20.6 for the right way to do this.
Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern Imagine if we were trying to pull out everything between bold-italic pairs: <b><i>this</i> and <i>that</i> are important</b> Oh, <b><i>me too!</i></b> A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be this one: m{ <b><i>(.*?)</i></b> }sx
You might be surprised to learn that the pattern doesn't do that. Many people incorrectly understand this as matching a
If the string in question is just one character, a negated class is remarkably more efficient than a minimal match, as in /BEGIN((?:(?!BEGIN).)*)END/ Applying this to the HTML-matching code, we end up with something like: m{ <b><i>( (?: (?!</b>|</i>). )* ) </i></b> }sx or perhaps: m{ <b><i>( (?: (?!</[ib]>). )* ) </i></b> }sx Jeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed really matters, such as: m{ <b><i> [^<]* # stuff not possibly bad, and not possibly the end. (?: # at this point, we can have '<' if not part of something bad (?! </?[ib]> ) # what we can't have < # okay, so match the '<' [^<]* # and continue with more safe stuff ) * </i></b> }sx This is a variation on Jeffrey's unrolling-the-loop technique, described in Chapter 5 of Mastering Regular Expressions . See AlsoThe non-greedy quantifiers in the "Regular Expressions" section of perlre (1), and in the "the rules of regular expression matching" section of Chapter 2 of Programming Perl |
|