home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Programming PHPProgramming PHPSearch this book

4.8. Regular Expressions

If you need more complex searching functionality than the previous methods provide, you can use regular expressions. A regular expression is a string that represents a pattern. The regular expression functions compare that pattern to another string and see if any of the string matches the pattern. Some functions tell you whether there was a match, while others make changes to the string.

PHP provides support for two different types of regular expressions: POSIX and Perl-compatible. POSIX regular expressions are less powerful, and sometimes slower, than the Perl-compatible functions, but can be easier to read. There are three uses for regular expressions: matching, which can also be used to extract information from a string; substituting new text for matching text; and splitting a string into an array of smaller chunks. PHP has functions for all three behaviors for both Perl and POSIX regular expressions. For instance, ereg( ) does a POSIX match, while preg_match( ) does a Perl match. Fortunately, there are a number of similarities between basic POSIX and Perl regular expressions, so we'll cover those before delving into the details of each library.

4.8.1. The Basics

Most characters in a regular expression are literal characters, meaning that they match only themselves. For instance, if you search for the regular expression "cow" in the string "Dave was a cowhand", you get a match because "cow" occurs in that string.

Some characters, though, have special meanings in regular expressions. For instance, a caret (^) at the beginning of a regular expression indicates that it must match the beginning of the string (or, more precisely, anchors the regular expression to the beginning of the string):

ereg('^cow', 'Dave was a cowhand');     // returns false
ereg('^cow', 'cowabunga!');             // returns true

Similarly, a dollar sign ($) at the end of a regular expression means that it must match the end of the string (i.e., anchors the regular expression to the end of the string):

ereg('cow$', 'Dave was a cowhand');     // returns false
ereg('cow$', "Don't have a cow");       // returns true

A period (.) in a regular expression matches any single character:

ereg('c.t', 'cat');                     // returns true
ereg('c.t', 'cut');                     // returns true
ereg('c.t', 'c t');                     // returns true
ereg('c.t', 'bat');                     // returns false
ereg('c.t', 'ct');                      // returns false

If you want to match one of these special characters (called a metacharacter), you have to escape it with a backslash:

ereg('\$5\.00', 'Your bill is $5.00 exactly');     // returns true
ereg('$5.00', 'Your bill is $5.00 exactly');       // returns false

Regular expressions are case-sensitive by default, so the regular expression "cow" doesn't match the string "COW". If you want to perform a case-insensitive POSIX-style match, you can use the eregi( ) function. With Perl-style regular expressions, you still use preg_match( ), but specify a flag to indicate a case-insensitive match (as you'll see when we discuss Perl-style regular expressions in detail later in this chapter).

So far, we haven't done anything we couldn't have done with the string functions we've already seen, like strstr( ). The real power of regular expressions comes from their ability to specify abstract patterns that can match many different character sequences. You can specify three basic types of abstract patterns in a regular expression:

  • A set of acceptable characters that can appear in the string (e.g., alphabetic characters, numeric characters, specific punctuation characters)

  • A set of alternatives for the string (e.g., "com", "edu", "net", or "org")

  • A repeating sequence in the string (e.g., at least one but no more than five numeric characters)

These three kinds of patterns can be combined in countless ways, to create regular expressions that match such things as valid phone numbers and URLs.

4.8.2. Character Classes

To specify a set of acceptable characters in your pattern, you can either build a character class yourself or use a predefined one. You can build your own character class by enclosing the acceptable characters in square brackets:

ereg('c[aeiou]t', 'I cut my hand');     // returns true
ereg('c[aeiou]t', 'This crusty cat');   // returns true
ereg('c[aeiou]t', 'What cart?');        // returns false
ereg('c[aeiou]t', '14ct gold');         // returns false

The regular expression engine finds a "c", then checks that the next character is one of "a", "e", "i", "o", or "u". If it isn't a vowel, the match fails and the engine goes back to looking for another "c". If a vowel is found, though, the engine then checks that the next character is a "t". If it is, the engine is at the end of the match and so returns true. If the next character isn't a "t", the engine goes back to looking for another "c".

You can negate a character class with a caret (^) at the start:

ereg('c[^aeiou]t', 'I cut my hand');    // returns false
ereg('c[^aeiou]t', 'Reboot chthon');    // returns true
ereg('c[^aeiou]t', '14ct gold');        // returns false

In this case, the regular expression engine is looking for a "c", followed by a character that isn't a vowel, followed by a "t".

You can define a range of characters with a hyphen (-). This simplifies character classes like "all letters" and "all digits":

ereg('[0-9]%', 'we are 25% complete');            // returns true
ereg('[0123456789]%', 'we are 25% complete');     // returns true
ereg('[a-z]t', '11th');                           // returns false
ereg('[a-z]t', 'cat');                            // returns true
ereg('[a-z]t', 'PIT');                            // returns false
ereg('[a-zA-Z]!', '11!');                         // returns false
ereg('[a-zA-Z]!', 'stop!');                       // returns true

When you are specifying a character class, some special characters lose their meaning, while others take on new meaning. In particular, the $ anchor and the period lose their meaning in a character class, while the ^ character is no longer an anchor but negates the character class if it is the first character after the open bracket. For instance, [^\]] matches any character that is not a closing bracket, while [$.^] matches any dollar sign, period, or caret.

The various regular expression libraries define shortcuts for character classes, including digits, alphabetic characters, and whitespace. The actual syntax for these shortcuts differs between POSIX-style and Perl-style regular expressions. For instance, with POSIX, the whitespace character class is "[[:space:]]", while with Perl it is "\s".



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.