Regular Expressions (Programming PHP)

4.8. Regular Expressions

If you need more complex searching functionality than the previous methods provide, you can use regular expressions. A regular expression is a string that represents a pattern. The regular expression functions compare that pattern to another string and see if any of the string matches the pattern. Some functions tell you whether there was a match, while others make changes to the string.

PHP provides support for two different types of regular expressions: POSIX and Perl-compatible. POSIX regular expressions are less powerful, and sometimes slower, than the Perl-compatible functions, but can be easier to read. There are three uses for regular expressions: matching, which can also be used to extract information from a string; substituting new text for matching text; and splitting a string into an array of smaller chunks. PHP has functions for all three behaviors for both Perl and POSIX regular expressions. For instance, ereg( ) does a POSIX match, while preg_match( ) does a Perl match. Fortunately, there are a number of similarities between basic POSIX and Perl regular expressions, so we'll cover those before delving into the details of each library.

4.8.1. The Basics

Most characters in a regular expression are literal characters, meaning that they match only themselves. For instance, if you search for the regular expression "cow" in the string "Dave was a cowhand", you get a match because "cow" occurs in that string.

Some characters, though, have special meanings in regular expressions. For instance, a caret (^) at the beginning of a regular expression indicates that it must match the beginning of the string (or, more precisely, anchors the regular expression to the beginning of the string):

ereg('^cow', 'Dave was a cowhand');     // returns false
ereg('^cow', 'cowabunga!');             // returns true

ereg('cow$', 'Dave was a cowhand');     // returns false
ereg('cow$', "Don't have a cow");       // returns true

A period (.) in a regular expression matches any single character:

ereg('c.t', 'cat');                     // returns true
ereg('c.t', 'cut');                     // returns true
ereg('c.t', 'c t');                     // returns true
ereg('c.t', 'bat');                     // returns false
ereg('c.t', 'ct');                      // returns false

If you want to match one of these special characters (called a metacharacter), you have to escape it with a backslash:

ereg('\$5\.00', 'Your bill is $5.00 exactly');     // returns true
ereg('$5.00', 'Your bill is $5.00 exactly');       // returns false

Regular expressions are case-sensitive by default, so the regular expression "cow" doesn't match the string "COW". If you want to perform a case-insensitive POSIX-style match, you can use the eregi( ) function. With Perl-style regular expressions, you still use preg_match( ), but specify a flag to indicate a case-insensitive match (as you'll see when we discuss Perl-style regular expressions in detail later in this chapter).

So far, we haven't done anything we couldn't have done with the string functions we've already seen, like strstr( ). The real power of regular expressions comes from their ability to specify abstract patterns that can match many different character sequences. You can specify three basic types of abstract patterns in a regular expression:

A set of acceptable characters that can appear in the string (e.g., alphabetic characters, numeric characters, specific punctuation characters)
A set of alternatives for the string (e.g., "com", "edu", "net", or "org")
A repeating sequence in the string (e.g., at least one but no more than five numeric characters)

These three kinds of patterns can be combined in countless ways, to create regular expressions that match such things as valid phone numbers and URLs.

4.8.2. Character Classes

To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one. You can build your own character class by enclosing the acceptable characters in square brackets:

ereg('c[aeiou]t', 'I cut my hand');     // returns true
ereg('c[aeiou]t', 'This crusty cat');   // returns true
ereg('c[aeiou]t', 'What cart?');        // returns false
ereg('c[aeiou]t', '14ct gold');         // returns false

The regular expression engine finds a "c", then checks that the next character is one of "a", "e", "i", "o", or "u". If it isn't a vowel, the match fails and the engine goes back to looking for another "c". If a vowel is found, though, the engine then checks that the next character is a "t". If it is, the engine is at the end of the match and so returns true. If the next character isn't a "t", the engine goes back to looking for another "c".

You can negate a character class with a caret (^) at the start:

ereg('c[^aeiou]t', 'I cut my hand');    // returns false
ereg('c[^aeiou]t', 'Reboot chthon');    // returns true
ereg('c[^aeiou]t', '14ct gold');        // returns false

In this case, the regular expression engine is looking for a "c", followed by a character that isn't a vowel, followed by a "t".

You can define a range of characters with a hyphen (-). This simplifies character classes like "all letters" and "all digits":

ereg('[0-9]%', 'we are 25% complete');            // returns true
ereg('[0123456789]%', 'we are 25% complete');     // returns true
ereg('[a-z]t', '11th');                           // returns false
ereg('[a-z]t', 'cat');                            // returns true
ereg('[a-z]t', 'PIT');                            // returns false
ereg('[a-zA-Z]!', '11!');                         // returns false
ereg('[a-zA-Z]!', 'stop!');                       // returns true

When you are specifying a character class, some special characters lose their meaning, while others take on new meaning. In particular, the $ anchor and the period lose their meaning in a character class, while the ^ character is no longer an anchor but negates the character class if it is the first character after the open bracket. For instance, [^\]] matches any character that is not a closing bracket, while [$.^] matches any dollar sign, period, or caret.

The various regular expression libraries define shortcuts for character classes, including digits, alphabetic characters, and whitespace. The actual syntax for these shortcuts differs between POSIX-style and Perl-style regular expressions. For instance, with POSIX, the whitespace character class is "[[:space:]]", while with Perl it is "\s".

4.8.3. Alternatives

You can use the vertical pipe (|) character to specify alternatives in a regular expression:

ereg('cat|dog', 'the cat rubbed my legs');        // returns true
ereg('cat|dog', 'the dog rubbed my legs');        // returns true
ereg('cat|dog', 'the rabbit rubbed my legs');     // returns false

The precedence of alternation can be a surprise: '^cat|dog$' selects from '^cat' and 'dog$', meaning that it matches a line that either starts with "cat" or ends with "dog". If you want a line that contains just "cat" or "dog", you need to use the regular expression '^(cat|dog)$'.

You can combine character classes and alternation to, for example, check for strings that don't start with a capital letter:

ereg('^([a-z]|[0-9])', 'The quick brown fox');  // returns false
ereg('^([a-z]|[0-9])', 'jumped over');           // returns true
ereg('^([a-z]|[0-9])', '10 lazy dogs');          // returns true

4.8.4. Repeating Sequences

To specify a repeating pattern, you use something called a quantifier. The quantifier goes after the pattern that's repeated and says how many times to repeat that pattern. Table 4-6 shows the quantifiers that are supported by both POSIX and Perl regular expressions.

Table 4-6. Regular expression quantifiers

Quantifier	Meaning
`?`	0 or 1
`*`	0 or more
`+`	1 or more
`{``n``}`	Exactly `n` times
`{``n``,m}`	At least `n`, no more than `m` times
`{``n``,}`	At least `n` times

To repeat a single character, simply put the quantifier after the character:

ereg('ca+t', 'caaaaaaat');                        // returns true
ereg('ca+t', 'ct');                               // returns false
ereg('ca?t', 'caaaaaaat');                        // returns false
ereg('ca*t', 'ct');                               // returns true

With quantifiers and character classes, we can actually do something useful, like matching valid U.S. telephone numbers:

ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '303-555-1212');      // returns true
ereg('[0-9]{3}-[0-9]{3}-[0-9]{4}', '64-9-555-1234');     // returns false

4.8.5. Subpatterns

You can use parentheses to group bits of a regular expression together to be treated as a single unit called a subpattern:

ereg('a (very )+big dog', 'it was a very very big dog'); // returns true
ereg('^(cat|dog)$', 'cat');                              // returns true
ereg('^(cat|dog)$', 'dog');                              // returns true

The parentheses also cause the substring that matches the subpattern to be captured. If you pass an array as the third argument to a match function, the array is populated with any captured substrings:

ereg('([0-9]+)', 'You have 42 magic beans', $captured);
// returns true and populates $captured

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern (if there is one), the second element is the substring that matched the second subpattern, and so on.