home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Programming PHPProgramming PHPSearch this book

4.10. Perl-Compatible Regular Expressions

Perl has long been considered the benchmark for powerful regular expressions. PHP uses a C library called pcre to provide almost complete support for Perl's arsenal of regular expression features. Perl regular expressions include the POSIX classes and anchors described earlier. A POSIX-style character class in a Perl regular expression works and understands non-English characters using the Unix locale system. Perl regular expressions act on arbitrary binary data, so you can safely match with patterns or strings that contain the NUL-byte (\x00).

4.10.5. Quantifiers and Greed

The POSIX quantifiers, which Perl also supports, are always greedy. That is, when faced with a quantifier, the engine matches as much as it can while still satisfying the rest of the pattern. For instance:

preg_match('/(<.*>)/', 'do <b>not</b> press the button', $match);
// $match[1] is '<b>not</b>'

The regular expression matches from the first less-than sign to the last greater-than sign. In effect, the .* matches everything after the first less-than sign, and the engine backtracks to make it match less and less until finally there's a greater-than sign to be matched.

This greediness can be a problem. Sometimes you need minimal (non-greedy) matching—that is, quantifiers that match as few times as possible to satisfy the rest of the pattern. Perl provides a parallel set of quantifiers that match minimally. They're easy to remember, because they're the same as the greedy quantifiers, but with a question mark (?) appended. Table 4-11 shows the corresponding greedy and non-greedy quantifiers supported by Perl-style regular expressions.

Table 4-11. Greedy and non-greedy quantifiers in Perl-compatible regular expressions

Greedy quantifier

Non-greedy quantifier

?
??
*
*?
+
+?
{m}
{m}?
{m,}
{m,}?
{m,n}
{m,n}?

Here's how to match a tag using a non-greedy quantifier:

preg_match('/(<.*?>)/', 'do <b>not</b> press the button', $match);
// $match[1] is '<b>'

Another, faster way is to use a character class to match every non-greater-than character up to the next greater-than sign:

preg_match('/(<[^>]*>)/', 'do <b>not</b> press the button', $match);
// $match[1] is '<b>'

4.10.8. Trailing Options

Perl-style regular expressions let you put single-letter options (flags) after the regular expression pattern to modify the interpretation, or behavior, of the match. For instance, to match case-insensitively, simply use the i flag:

preg_match('/cat/i', 'Stop, Catherine!');         // returns true

Table 4-12 shows the modifiers from Perl that are supported in Perl-compatible regular expressions.

Table 4-12. Perl flags

Modifier

Meaning

/regexp/i

Match case-insensitively.

/regexp/s

Make period (.) match any character, including newline (\n).

/regexp/x

Remove whitespace and comments from the pattern.

/regexp/m

Make caret (^) match after, and dollar sign ($) match before, internal newlines (\n).

/regexp/e

If the replacement string is PHP code, eval( ) it to get the actual replacement string.

PHP's Perl-compatible regular expression functions also support other modifiers that aren't supported by Perl, as listed in Table 4-13.

Table 4-13. Additional PHP flags

Modifier

Meaning

/regexp/U

Reverses the greediness of the subpattern; * and + now match as little as possible, instead of as much as possible

/regexp/u

Causes pattern strings to be treated as UTF-8

/regexp/X

Causes a backslash followed by a character with no special meaning to emit an error

/regexp/A

Causes the beginning of the string to be anchored as if the first character of the pattern were ^

/regexp/D

Causes the $ character to match only at the end of a line

/regexp/S

Causes the expression parser to more carefully examine the structure of the pattern, so it may run slightly faster the next time (such as in a loop)

It's possible to use more than one option in a single pattern, as demonstrated in the following example:

$message = <<< END
To: you@youcorp
From: me@mecorp
Subject: pay up

Pay me or else!
END;
preg_match('/^subject: (.*)/im', $message, $match);
// $match[1] is 'pay up'

4.10.10. Lookahead and Lookbehind

It's sometimes useful in patterns to be able to say "match here if this is next." This is particularly common when you are splitting a string. The regular expression describes the separator, which is not returned. You can use lookahead to make sure (without matching it, thus preventing it from being returned) that there's more data after the separator. Similarly, lookbehind checks the preceding text.

Lookahead and lookbehind come in two forms: positive and negative. A positive lookahead or lookbehind says "the next/preceding text must be like this." A negative lookahead or lookbehind says "the next/preceding text must not be like this." Table 4-14 shows the four constructs you can use in Perl-compatible patterns. None of the constructs captures text.

Table 4-14. Lookahead and lookbehind assertions

Construct

Meaning

(?=subpattern)

Positive lookahead

(?!subpattern)

Negative lookahead

(?<=subpattern)

Positive lookbehind

(?<!subpattern)

Negative lookbehind

A simple use of positive lookahead is splitting a Unix mbox mail file into individual messages. The word "From" starting a line by itself indicates the start of a new message, so you can split the mailbox into messages by specifying the separator as the point where the next text is "From" at the start of a line:

$messages = preg_split('/(?=^From )/m', $mailbox);

A simple use of negative lookbehind is to extract quoted strings that contain quoted delimiters. For instance, here's how to extract a single-quoted string (note that the regular expression is commented using the x modifier):

$input = <<< END
name = 'Tim O\'Reilly';
END;

$pattern = <<< END
'                   # opening quote
(                   # begin capturing
  .*?               #   the string
  (?<! \\\\ )       #   skip escaped quotes
)                   # end capturing
'                   # closing quote
END;
preg_match( "($pattern)x", $input, $match);
echo $match[1];
Tim O\'Reilly

The only tricky part is that, to get a pattern that looks behind to see if the last character was a backslash, we need to escape the backslash to prevent the regular expression engine from seeing "\)", which would mean a literal close parenthesis. In other words, we have to backslash that backslash: "\\)". But PHP's string-quoting rules say that \\ produces a literal single backslash, so we end up requiring four backslashes to get one through the regular expression! This is why regular expressions have a reputation for being hard to read.

Perl limits lookbehind to constant-width expressions. That is, the expressions cannot contain quantifiers, and if you use alternation, all the choices must be the same length. The Perl-compatible regular expression engine also forbids quantifiers in lookbehind, but does permit alternatives of different lengths.

4.10.13. Functions

There are five classes of functions that work with Perl-compatible regular expressions: matching, replacing, splitting, filtering, and a utility function for quoting text.

4.10.13.1. Matching

The preg_match( ) function performs Perl-style pattern matching on a string. It's the equivalent of the m// operator in Perl. The preg_match( ) function takes the same arguments and gives the same return value as the ereg( ) function, except that it takes a Perl-style pattern instead of a standard pattern:

$found = preg_match(pattern, string [, captured ]);

For example:

preg_match('/y.*e$/', 'Sylvie');        // returns true
preg_match('/y(.*)e$/', Sylvie', $m);   // $m is array('Sylvie', 'lvi')

While there's an eregi( ) function to match case-insensitively, there's no preg_matchi( ) function. Instead, use the i flag on the pattern:

preg_match('y.*e$/i', 'SyLvIe');        // returns true

The preg_match_all( ) function repeatedly matches from where the last match ended, until no more matches can be made:

$found = preg_match_all(pattern, string, matches [, order ]);

The order value, either PREG_PATTERN_ORDER or PREG_SET_ORDER, determines the layout of matches. We'll look at both, using this code as a guide:

$string = <<< END
13 dogs
12 rabbits
8 cows
1 goat
END;
preg_match_all('/(\d+) (\S+)/', $string, $m1, PREG_PATTERN_ORDER);
preg_match_all('/(\d+) (\S+)/', $string, $m2, PREG_SET_ORDER);

With PREG_PATTERN_ORDER (the default), each element of the array corresponds to a particular capturing subpattern. So $m1[0] is an array of all the substrings that matched the pattern, $m1[1] is an array of all the substrings that matched the first subpattern (the numbers), and $m1[2] is an array of all the substrings that matched the second subpattern (the words). The array $m1 has one more elements than subpatterns.

With PREG_SET_ORDER, each element of the array corresponds to the next attempt to match the whole pattern. So $m2[0] is an array of the first set of matches ('13 dogs', '13', 'dogs'), $m2[1] is an array of the second set of matches ('12 rabbits', '12', 'rabbits'), and so on. The array $m2 has as many elements as there were successful matches of the entire pattern.

Example 4-2 fetches the HTML at a particular web address into a string and extracts the URLs from that HTML. For each URL, it generates a link back to the program that will display the URLs at that address.

4.10.13.2. Replacing

The preg_replace( ) function behaves like the search and replace operation in your text editor. It finds all occurrences of a pattern in a string and changes those occurrences to something else:

$new = preg_replace(pattern, replacement, subject [, limit ]);

The most common usage has all the argument strings, except for the integer limit. The limit is the maximum number of occurrences of the pattern to replace (the default, and the behavior when a limit of -1 is passed, is all occurrences).

$better = preg_replace('/<.*?>/', '!', 'do <b>not</b> press the button');
// $better is 'do !not! press the button'

Pass an array of strings as subject to make the substitution on all of them. The new strings are returned from preg_replace( ):

$names = array('Fred Flintstone',
               'Barney Rubble',
               'Wilma Flintstone',
               'Betty Rubble');
$tidy  = preg_replace('/(\w)\w* (\w+)/', '\1 \2', $names);
// $tidy is array ('F Flintstone', 'B Rubble', 'W Flintstone', 'B Rubble')

To perform multiple substitutions on the same string or array of strings with one call to preg_replace( ), pass arrays of patterns and replacements:

$contractions = array("/don't/i", "/won't/i", "/can't/i");
$expansions = array('do not', 'will not', 'can not');
$string = "Please don't yell--I can't jump while you won't speak";
$longer = preg_replace($contractions, $expansions, $string);
// $longer is 'Please do not yell--I can not jump while you will not speak';

If you give fewer replacements than patterns, text matching the extra patterns is deleted. This is a handy way to delete a lot of things at once:

$html_gunk = array('/<.*?>/', '/&.*?;/');
$html = '&eacute; : <b>very</b> cute';
$stripped  = preg_replace($html_gunk, array( ), $html);
// $stripped is ' : very cute'

If you give an array of patterns but a single string replacement, the same replacement is used for every pattern:

$stripped = preg_replace($html_gunk, '', $html);

The replacement can use backreferences. Unlike backreferences in patterns, though, the preferred syntax for backreferences in replacements is $1, $2, $3, etc. For example:

echo preg_replace('/(\w)\w+\s+(\w+)/', '$2, $1.', 'Fred Flintstone')
Flintstone, F.

The /e modifier makes preg_replace( ) treat the replacement string as PHP code that returns the actual string to use in the replacement. For example, this converts every Celsius temperature to Fahrenheit:

$string  = 'It was 5C outside, 20C inside';
echo preg_replace('/(\d+)C\b/e', '$1*9/5+32', $string);
It was 41 outside, 68 inside

This more complex example expands variables in a string:

$name = 'Fred';
$age  = 35;
$string = '$name is $age';
preg_replace('/\$(\w+)/e', '$$1', $string);

Each match isolates the name of a variable ($name, $age). The $1 in the replacement refers to those names, so the PHP code actually executed is $name and $age. That code evaluates to the value of the variable, which is what's used as the replacement. Whew!

4.10.13.3. Splitting

Whereas you use preg_match_all( ) to extract chunks of a string when you know what those chunks are, use preg_split( ) to extract chunks when you know what separates the chunks from each other:

$chunks = preg_split(pattern, string [, limit [, flags ]]);

The pattern matches a separator between two chunks. By default, the separators are not returned. The optional limit specifies the maximum number of chunks to return (-1 is the default, which means all chunks). The flags argument is a bitwise OR combination of the flags PREG_SPLIT_NO_EMPTY (empty chunks are not returned) and PREG_SPLIT_DELIM_CAPTURE (parts of the string captured in the pattern are returned).

For example, to extract just the operands from a simple numeric expression, use:

$ops = preg_split('{[+*/-]}', '3+5*9/2');
// $ops is array('3', '5', '9', '2')

To extract the operands and the operators, use:

$ops = preg_split('{([+*/-])}', '3+5*9/2', -1, PREG_SPLIT_DELIM_CAPTURE);
// $ops is array('3', '+', '5', '*', '9', '/', '2')

An empty pattern matches at every boundary between characters in the string. This lets you split a string into an array of characters:

$array = preg_split('//', $string);

A variation on preg_replace( ) is preg_replace_callback( ). This calls a function to get the replacement string. The function is passed an array of matches (the zeroth element is all the text that matched the pattern, the first is the contents of the first captured subpattern, and so on). For example:

function titlecase ($s) {
  return ucfirst(strtolower($s[0]));
}

$string = 'goodbye cruel world';
$new = preg_replace_callback('/\w+/', 'titlecase', $string);
echo $new;
Goodbye Cruel World

4.10.13.4. Filtering an array with a regular expression

The preg_grep( ) function returns those elements of an array that match a given pattern:

$matching = preg_grep(pattern, array);

For instance, to get only the filenames that end in .txt, use:

$textfiles = preg_grep('/\.txt$/', $filenames);


Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.