home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePHP CookbookSearch this book

13.6. Matching a Valid Email Address

13.6.3. Discussion

The pattern in the Solution accepts any email address that has a name of any sequence of characters that isn't a @ or whitespace. After the @, you need at least one domain name consisting of the letters a-z, the numbers 0-9, and the hyphen, separated by periods, and proceed it with as many subdomains you want. Finally, you end with either a two-digit country code or another top-level domain, such as .com or .edu.

The solution pattern is handy because it still works if ICANN adds new top-level domains. However, it does allow through a few false positives. This more strict pattern explicitly enumerates the current noncountry top-level domains:

/
    ^               # anchor at the beginning
    [^@\s]+         # name is all characters except @ and whitespace
    @               # the @ divides name and domain
    (
        [-a-z0-9]+  # (sub)domains are letters, numbers, and hyphens
        \.          # separated by a period
    )+              # and we can have one or more of them
    (
        [a-z]{2}    # TLDs can be a two-letter alphabetical country code
        |com|net    # or one of 
        |edu|org    # many 
        |gov|mil    # possible
        |int|biz    # three-letter
        |pro        # combinations
        |info|arpa  # or even
        |aero|coop  # a few 
        |name       # four-letter ones
        |museum     # plus one that's six-letters long!
    )
    $               # anchor at the end
/ix                 # and everything is case-insensitive

Both patterns are intentionally liberal in what they accept, because we assume you're only trying to make sure someone doesn't accidentally leave off their top-level domain or type in something fake such as "not telling." For instance, there's no domain "-.com", but "foo@-.com" flies through without a blip. (It wouldn't be hard to modify the pattern to correct this, but that's left as an exercise for you.) On the other hand, it is legal to have an address of "Tim O'Reilly@oreilly.com", and our pattern won't accept this. However, spaces in email addresses are rare; because a space almost always represents a mistake, we flag that address as bad.

The canonical definition of what's a valid address is documented in RFC 822; however, writing code to handle all cases isn't a pretty task. Here's one example of what you need to consider: people are allowed to embed comments inside addresses! Comments are set inside parentheses, so it's valid to write:

Tim (is the man @ computer books) @ oreilly.com

That's equivalent to "tim@oreilly.com". (So, again, the pattern fails on that address.)

Alternatively, the IMAP extension has an RFC 822-compliant address parser. This parser correctly navigates through whitespace comments and other oddities, but it allows obvious mistakes because it assumes that addresses without hostnames are local:

$email = 'stephen(his account)@ example(his host)';
$parsed = imap_rfc822_parse_adrlist($email,'');
print_r($parsed);
Array
(
    [0] => stdClass Object
        (
            [mailbox] => stephen
            [host] => example
            [personal] => his host
        )

)

Reassembling the mailbox and host, you get "stephen@example", which probably isn't what you want. The empty string you must pass in as the second argument defeats your ability to check for valid hostnames.

Some people like behind-the-scenes processing such as DNS lookups, to check if the address is valid. This doesn't make much sense because that technique won't always work, and you may end up rejecting perfectly valid people from your site, due to no fault of their own. (Also, its unlikely a mail administrator would fix his mail handling just to work around one web site's email validation scheme.)

Another consideration when validating email addresses is that it doesn't take too much work for a user to enter a completely legal and working address that isn't his. For instance, one of the authors used to have a bad habit of entering "billg@microsoft.com" when signing up for Microsoft's web sites because "Hey! Maybe Bill doesn't know about that new version of Internet Explorer?"

If the primary concern is to avoid typos, make people enter their address twice, and compare the two. If they match, it's probably correct. Also, filter out popular bogus addresses, such as "president@whitehouse.gov" and the previously mentioned "billg@microsoft.com". (This does have the downside of not letting The President of the United States of America or Bill Gates sign up for your site.)

However, if you need to ensure people actually have access to the email address they provide, one technique is to send a message to their address and require them to either reply to the message or go to a page on your site and type in a special code printed in the body of the message to confirm their sign-up. If you do choose the special code route, we suggest that you don't generate a random string of letters, such as HSD5nbADl8. Since it looks like garbage, it's hard to retype it correctly. Instead, use a word list and create code words such as television4coatrack. While, on occasion, it's possible to divine hidden meanings in these combos, you can cut the error rate and your support costs.

13.6.4. See Also

Recipe 8.6 for information about generating good passwords; Recipe 8.27 for a web site account deactivation program; documentation on imap_rfc822_parse_adrlist( ) at http://www.php.net/imap-rfc822-parse-adrlist.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.