13.5. Choosing Greedy or Nongreedy Matches
// find all bolded sections preg_match_all('#<b>.+?</b>#', $html, $matches);
// find all bolded sections preg_match_all('#<b>.+</b>#U', $html, $matches);
By default, all regular expressions in PHP are what's known as greedy. This means a quantifier always tries to match as many characters as possible.
For example, take the pattern p.*, which matches a p and then 0 or more characters, and match it against the string php. A greedy regular expression finds one match, because after it grabs the opening p, it continues on and also matches the hp. A nongreedy regular expression, on the other hand, finds a pair of matches. As before, it matches the p and also the h, but then instead of continuing on, it backs off and leaves the final p uncaptured. A second match then goes ahead and takes the closing letter.
The following code shows that the greedy match finds only one hit; the nongreedy ones find two:
print preg_match_all('/p.*/', "php"); // greedy print preg_match_all('/p.*?/', "php"); // nongreedy print preg_match_all('/p.*/U', "php"); // nongreedy 1 2 2
Initially, all regular expressions were strictly greedy. Therefore, you can't use this syntax with ereg( ) or ereg_replace( ). Greedy matching isn't supported by the older engine that powers these functions; instead, you must use Perl-compatible functions.
$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>'; preg_match_all('#<b>(.+)</b>#', $html, $bolds); print_r($bolds); Array (  => I am bold.</b> <i>I am italic.</i> <b>I am also bold. )
Because there's a second set of bold tags, the pattern extends past the first </b>, which makes it impossible to correctly break up the HTML. If you use minimal matching, each set of tags is self-contained:
$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>'; preg_match_all('#<b>(.+?)</b>#', $html, $bolds); print_r($bolds); Array (  => I am bold.  => I am also bold. )
Of course, this can break down if your markup isn't 100% valid, and there are stray bold tags lying around. If your goal is just to remove all (or some) HTML tags from a block of text, you're better off not using a regular expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See Recipe 11.12 for more details.
Finally, even though the idea of nongreedy matching comes from Perl, the -U modifier is incompatible with Perl and is unique to PHP's Perl-compatible regular expressions. It inverts all quantifiers, turning them from greedy to nongreedy and also the reverse. So, to get a greedy quantifier inside of a pattern operating under a trailing /U, just add a ? to the end, the same way you would normally turn a greedy quantifier into a nongreedy one.
13.5.4. See Also
Copyright © 2003 O'Reilly & Associates. All rights reserved.