Email Addresses (CGI Programming with Perl)

9.2.1. Validating Syntax

A common question that new CGI developers ask is what the regular expression for matching email addresses looks like. If you ask around, some people will refer you to a book called Mastering Regular Expressions by Jeffrey Friedl (O'Reilly & Associates, Inc.). Others might give you a simple expression that checks for "@" and that checks that the domain name ends in a dot and two or three letters. In fact, neither of these answers is fully accurate.

To understand why, let's review a little history. The standard document for defining email address names is RFC 822. It was published in 1982. Does that seem like a long time ago to you? It should. The Internet was radically different then. In fact, it wasn't called the Internet then -- it was a collection of many different networks, including ARPAnet, Bitnet, and CSNET, each with their own naming conventions. TCP/IP was being introduced as a new networking protocol and hosts only numbered in the hundreds. It wasn't until 1983 that serious work began on implementing domain name servers. The hierarchical names that we recognize today like www.oreilly.com did not exist back then.

So that is half of the story. The other half of the story is that Jeffrey Friedl, in his book Mastering Regular Expressions, tackled creating a regular expression to handle the parsing of RFC 822 email addresses. The book is the best reference for understanding regular expressions in Perl or any other context. Many people cite the regular expression he constructs as the only definitive test of whether an Internet email address is valid. But unfortunately these people have misunderstood what it does; it tests for compliance with RFC 822. According to RFC 822, these are all syntactically valid email addresses:

Alfred Neuman <Neuman@BBN-TENEXA>
":sysmail"@  Some-Group. Some-Org
Muhammed.(I am  the greatest) Ali @(the)Vegas.WBA

Do any of them look like the type of email address you'd want to capture in an HTML form? It is true that RFC 822 has not been superseded by another RFC and is still a standard, but it is equally true that the problem we are trying to solve is radically different in time and context from the problem that it solved in 1982.

We want an expression to recognize a syntactically valid email address as required on the Internet today. We are interested only in today's standard Internet domain-naming convention. That would actually rule out all of the above addresses, since none of them end in one of our current top level domains (.com, .net, .edu, .uk, etc.). There are other important distinctions.

The first example is a full email address including a name and what RFC 822 refers to as the address specification in angled brackets. You may have seen this expanded syntax in your email software. We do not need, and probably don't want, this additional information in an email address captured in a form. In all likelihood, the user's name is being captured separately in other fields. When we need to validate an email address that a user has entered, we are generally only interested in the address specification itself. So henceforth when we refer to an email address, we are simply referring to this address specification, the user@hostname part.

The second example contains a quoted element (any group of characters separated by a "." or a "@" we will refer to as an element [18]). Quoted elements are completely acceptable and still work fine on today's Internet. If you want to accept valid email addresses, you should accept quoted elements. Only elements on the left side of the "@" may be quoted, but any ASCII character is allowed within quotes (some have to be escaped with a backslash). This is why any check in our code for "invalid characters" in an email address would be flawed, and this is why it is very dangerous to pass email addresses through a shell as an argument to a command.

[18]RFC 822 more technically refers to this as an "atom."

The second email address also includes spaces. Spaces (and tabs) are legal between any element and at the beginning and end of the email address. However, it doesn't change the meaning to remove them and that is exactly what emailers generally do when you send a message to an email address containing spaces. Note, however, that you cannot simply remove every space in an email address since spaces appearing within quotes do carry meaning and must be left intact. Only those appearing outside of quotes can be removed. We will strip them in our example. We probably don't have to; it is not unreasonable to expect your users to enter the email address without extra spaces.

The last example contains comments. It is perfectly legal to include comments, which are enclosed within parentheses, anywhere where spaces are allowed. Comments are only intended to pass additional information to humans, and machines can ignore them. Thus, it is rather silly to enter them into an automated web form. We will simplify our code by not accepting comments in the email addresses we are checking.

So here is the code that we will use to validate email addresses. It is considerably shorter than the example given by Mr. Friedl, but it is not nearly so flexible. It does not support comments, it removes spaces before validating, and it limits hosts to modern domain names and IP addresses. Nonetheless, it is quite complicated, and the regular expression to perform the check would be too difficult to type out. Instead, we build it through a number of intermediate variables. The process of doing this is too involved to explain here. If you want to understand how to build complex regular expressions like this, we highly recommend Mastering Regular Expressions.

One note, however: the variable $top_level contains the expression that matches valid top-level domains. Our current top level domains have two (e.g., .us, .uk, .au, etc.) or three letters (e.g., .com, .org, .net, etc.). The number of top-level domains will certainly increase. Some of the proposed new names, such as .firm, have more than three characters. Thus, the regular expression below will allow anywhere from two to four characters:

my $top_level   = qq{ (?: $atom_char ){2,4} };

If you want to be more restrictive today, you can limit it to three. Likewise, if top-level domains with more than four characters are someday allowed, you would need to increase it.

Finally, here's the code:

sub validate_email_address {
    my $addr_to_check = shift;
    $addr_to_check =~ s/("(?:[^"\\]|\\.)*"|[^\t "]*)[ \t]*/$1/g;
    
    my $esc         = '\\\\';
    my $space       = '\040';
    my $ctrl        = '\000-\037';
    my $dot         = '\.';
    my $nonASCII    = '\x80-\xff';
    my $CRlist      = '\012\015';
    my $letter      = 'a-zA-Z';
    my $digit       = '\d';
    
    my $atom_char   = qq{ [^$space<>\@,;:".\\[\\]$esc$ctrl$nonASCII] };
    my $atom        = qq{ $atom_char+ };
    my $byte        = qq{ (?: 1?$digit?$digit | 
                              2[0-4]$digit    | 
                              25[0-5]         ) };
    
    my $qtext       = qq{ [^$esc$nonASCII$CRlist"] };
    my $quoted_pair = qq{ $esc [^$nonASCII] };
    my $quoted_str  = qq{ " (?: $qtext | $quoted_pair )* " };
    
    my $word        = qq{ (?: $atom | $quoted_str ) };
    my $ip_address  = qq{ \\[ $byte (?: $dot $byte ){3} \\] };
    my $sub_domain  = qq{ [$letter$digit]
                          [$letter$digit-]{0,61} [$letter$digit]};
    my $top_level   = qq{ (?: $atom_char ){2,4} };
    my $domain_name = qq{ (?: $sub_domain $dot )+ $top_level };
    my $domain      = qq{ (?: $domain_name | $ip_address ) };
    my $local_part  = qq{ $word (?: $dot $word )* };
    my $address     = qq{ $local_part \@ $domain };
    
    return $addr_to_check =~ /^$address$/ox ? $addr_to_check : "";
}

If you supply an email address to validate_email_address, it will strip out any spaces or tabs that are not within quotes. We're being a little lenient here since spaces within elements (as opposed to spaces around elements) are actually illegal, but we'll just strip them in this step along with the legal spaces. We then check the address against our regular expression. If it matches, the email address is valid and is returned (without spaces). Otherwise, an empty string is returned, which evaluates to false in Perl. You can use the subroutine like so:

use strict;
use CGI;
use CGIBook::Error;

my $q     = new CGI;
my $email = validate_email_address( $q->param( "email" ) );

unless ( $email ) {
    error( $q, "The email address you entered is invalid. " .
               "Please use your browser's Back button to " .
               "return to the form and try again." );
}
.
.

If you were planning to check multiple email addresses or intended to use this in an environment where your Perl code is precompiled (like mod_perl or FastCGI), then you could optimize this code by building the regular expression once and caching this expression. However, this example is intended more to demonstrate why validating email addresses is a challenge than to be used in production (it does not resolve the issue that an email address can be syntactically valid yet bad).

9.2. Email Addresses

9.2.1. Validating Syntax