home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Book HomePHP CookbookSearch this book

13.5. Choosing Greedy or Nongreedy Matches

13.5.3. Discussion

By default, all regular expressions in PHP are what's known as greedy. This means a quantifier always tries to match as many characters as possible.

For example, take the pattern p.*, which matches a p and then 0 or more characters, and match it against the string php. A greedy regular expression finds one match, because after it grabs the opening p, it continues on and also matches the hp. A nongreedy regular expression, on the other hand, finds a pair of matches. As before, it matches the p and also the h, but then instead of continuing on, it backs off and leaves the final p uncaptured. A second match then goes ahead and takes the closing letter.

The following code shows that the greedy match finds only one hit; the nongreedy ones find two:

print preg_match_all('/p.*/', "php");  // greedy
print preg_match_all('/p.*?/', "php"); // nongreedy
print preg_match_all('/p.*/U', "php"); // nongreedy

Greedy matching is also known as maximal matching and nongreedy matching can be called minimal matching, because these options match either the maximum or minimum number of characters possible.

Initially, all regular expressions were strictly greedy. Therefore, you can't use this syntax with ereg( ) or ereg_replace( ). Greedy matching isn't supported by the older engine that powers these functions; instead, you must use Perl-compatible functions.

Nongreedy matching is frequently useful when trying to perform simplistic HTML parsing. Let's say you want to find all text between bold tags. With greedy matching, you get this:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+)</b>#', $html, $bolds);
    [0] => I am bold.</b> <i>I am italic.</i> <b>I am also bold.


Because there's a second set of bold tags, the pattern extends past the first </b>, which makes it impossible to correctly break up the HTML. If you use minimal matching, each set of tags is self-contained:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+?)</b>#', $html, $bolds);
    [0] => I am bold.
    [1] => I am also bold.

Of course, this can break down if your markup isn't 100% valid, and there are stray bold tags lying around.[12] If your goal is just to remove all (or some) HTML tags from a block of text, you're better off not using a regular expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See Recipe 11.12 for more details.

[12]It's possible to have valid HTML and still get into trouble. For instance, if you have bold tags inside a comment. A true HTML parser ignores this section, but our pattern won't.

Finally, even though the idea of nongreedy matching comes from Perl, the -U modifier is incompatible with Perl and is unique to PHP's Perl-compatible regular expressions. It inverts all quantifiers, turning them from greedy to nongreedy and also the reverse. So, to get a greedy quantifier inside of a pattern operating under a trailing /U, just add a ? to the end, the same way you would normally turn a greedy quantifier into a nongreedy one.

13.5.4. See Also

Recipe 13.9 for more on capturing text inside HTML tags; Recipe 11.12 for more on stripping HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.

Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.