Programming PHPProgramming PHPSearch this book

4.9. POSIX-Style Regular Expressions

Now that you understand the basics of regular expressions, we can explore the details. POSIX-style regular expressions use the Unix locale system. The locale system provides functions for sorting and identifying characters that let you intelligently work with text from languages other than English. In particular, what constitutes a "letter" varies from language to language (think of à and ç), and there are character classes in POSIX regular expressions that take this into account.

However, POSIX regular expressions are designed for use with only textual data. If your data has a NUL-byte (\x00) in it, the regular expression functions will interpret it as the end of the string, and matching will not take place beyond that point. To do matches against arbitrary binary data, you'll need to use Perl-compatible regular expressions, which are discussed later in this chapter. Also, as we already mentioned, the Perl-style regular expression functions are often faster than the equivalent POSIX-style ones.

4.9.1. Character Classes

As shown in Table 4-7, POSIX defines a number of named sets of characters that you can use in character classes. The expansions given in Table 4-7 are for English. The actual letters vary from locale to locale.

Table 4-7. POSIX character classes

Class

Description

Expansion

[:alnum:]

Alphanumeric characters

[0-9a-zA-Z]
[:alpha:]

Alphabetic characters (letters)

[a-zA-Z]
[:ascii:]

7-bit ASCII

[\x01-\x7F]
[:blank:]

Horizontal whitespace (space, tab)

[ \t]
[:cntrl:]

Control characters

[\x01-\x1F]
[:digit:]

Digits

[0-9]
[:graph:]

Characters that use ink to print (non-space, non-control)

[^\x01-\x20]
[:lower:]

Lowercase letter

[a-z]
[:print:]

Printable character (graph class plus space and tab)

[\t\x20-\xFF]
[:punct:]

Any punctuation character, such as the period (.) and the semicolon (;)

[-!"#$%&'( )*+,./:;<=>?@[\\]^_`{|}~]
[:space:]

Whitespace (newline, carriage return, tab, space, vertical tab)

[\n\r\t \x0B]
[:upper:]

Uppercase letter

[A-Z]
[:xdigit:]

Hexadecimal digit

[0-9a-fA-F]

Each [:something:] class can be used in place of a character in a character class. For instance, to find any character that's a digit, an uppercase letter, or an at sign (@), use the following regular expression:

[@[:digit:][:upper:]]

However, you can't use a character class as the endpoint of a range:

ereg('[A-[:lower:]]', 'string');        // invalid regular expression

Some locales consider certain character sequences as if they were a single character—these are called collating sequences. To match one of these multicharacter sequences in a character class, enclose it with [. and .]. For example, if your locale has the collating sequence ch, you can match s, t, or ch with this character class:

[st[.ch.]]

The final POSIX extension to character classes is the equivalence class, specified by enclosing the character in [= and =]. Equivalence classes match characters that have the same collating order, as defined in the current locale. For example, a locale may define a, á, and ä as having the same sorting precedence. To match any one of them, the equivalence class is [=a=].

4.9.3. Functions

There are three categories of functions for POSIX-style regular expressions: matching, replacing, and splitting.

4.9.3.1. Matching

The ereg( ) function takes a pattern, a string, and an optional array. It populates the array, if given, and returns true or false depending on whether a match for the pattern was found in the string:

$found = ereg(pattern, string [, captured ]);

For example:

ereg('y.*e$', 'Sylvie');       // returns true
ereg('y(.*)e$', 'Sylvie', $a); // returns true, $a is array('Sylvie', 'lvi')

The zeroth element of the array is set to the entire string being matched against. The first element is the substring that matched the first subpattern, the second element is the substring that matched the second subpattern, and so on.

The eregi( ) function is a case-insensitive form of ereg( ). Its arguments and return values are the same as those for ereg( ).

Example 4-1 uses pattern matching to determine whether a credit-card number passes the Luhn checksum and whether the digits are appropriate for a card of a specific type.

Example 4-1. Credit-card validator

// The Luhn checksum determines whether a credit-card number is syntactically
// correct; it cannot, however, tell if a card with the number has been issued,
// is currently active, or has enough space left to accept a charge.

function IsValidCreditCard($inCardNumber, $inCardType) {
  // Assume it's okay
  $isValid = true;

  // Strip all non-numbers from the string
  $inCardNumber = ereg_replace('[^[:digit:]]','', $inCardNumber); 

  // Make sure the card number and type match
  switch($inCardType) { 
    case 'mastercard':
      $isValid = ereg('^5[1-5].{14}$', $inCardNumber); 
      break; 

    case 'visa':
      $isValid = ereg('^4.{15}$|^4.{12}$', $inCardNumber); 
      break; 

    case 'amex':
      $isValid = ereg('^3[47].{13}$', $inCardNumber); 
      break; 

    case 'discover':
      $isValid = ereg('^6011.{12}$', $inCardNumber); 
      break; 

    case 'diners':
      $isValid = ereg('^30[0-5].{11}$|^3[68].{12}$', $inCardNumber); 
      break; 

      case 'jcb':
      $isValid = ereg('^3.{15}$|^2131|1800.{11}$', $inCardNumber);
      break; 
  }

  // It passed the rudimentary test; let's check it against the Luhn this time
  if($isValid) {
    // Work in reverse
    $inCardNumber = strrev($inCardNumber);

    // Total the digits in the number, doubling those in odd-numbered positions
    $theTotal = 0;
    for ($i = 0; $i < strlen($inCardNumber); $i++) {
      $theAdder = (int) $inCardNumber{$i};

      // Double the numbers in odd-numbered positions
      if($i % 2) {
        $theAdder << 1;
        if($theAdder > 9) { $theAdder -= 9; }
      }

      $theTotal += $theAdder;
    }

    // Valid cards will divide evenly by 10
    $isValid = (($theTotal % 10) == 0);
  }

  return $isValid;
}


Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.