Chapter 10.
Pattern Matching with Regular Expressions
A regular
expression is an object that describes a pattern of
characters. The JavaScript RegExp class represents regular
expressions, and both String and RegExp define methods that use
regular expressions to perform powerful pattern-matching and
search-and-replace functions on text.[36]
JavaScript regular expressions were standardized in ECMAScript v3.
JavaScript 1.2 implements a subset of the regular expression features
required by ECMAScript v3, and JavaScript 1.5 implements the full
standard. JavaScript regular expressions are strongly based on the
regular expression facilities of the Perl programming language.
Roughly speaking, we can say that JavaScript 1.2 implements Perl 4
regular expressions, and JavaScript 1.5 implements a large subset of
Perl 5 regular expressions.
This chapter begins by defining the syntax that regular expressions
use to describe textual patterns. Then it moves on to describe the
String and RegExp methods that use regular expressions.
10.1. Defining Regular Expressions
In JavaScript, regular expressions are represented by
RegExp
objects. RegExp objects may be created with the RegExp(
) constructor, of course, but they are more often created
using a special literal syntax. Just as string literals are specified
as characters within quotation marks,
regular expression literals are
specified as characters within a pair of slash
(/) characters. Thus, your JavaScript code may
contain lines like this:
var pattern = /s$/;
This line creates a new RegExp object and assigns it to the variable
pattern. This particular RegExp object matches any
string that ends with the letter "s". (We'll talk
about the grammar for defining patterns shortly.) This regular
expression could have equivalently been defined with the
RegExp( ) constructor like this:
var pattern = new RegExp("s$");
Creating a RegExp object, either literally or with the
RegExp( ) constructor, is the easy part. The more
difficult task is describing the desired pattern of characters using
regular expression syntax. JavaScript adopts a fairly complete subset
of the regular expression syntax used by Perl, so if you are an
experienced Perl programmer, you already know how to describe
patterns in JavaScript.
Regular expression pattern
specifications consist of a series of characters. Most characters,
including all alphanumeric characters, simply describe characters to
be matched literally. Thus, the regular expression
/java/ matches any string that contains the
substring "java". Other characters in regular expressions
are not matched literally, but have special significance. For
example, the regular expression /s$/ contains two
characters. The first, "s", matches itself literally. The
second, "$", is a special
metacharacter that matches the end of a string. Thus, this regular
expression matches any string that contains the letter
"s" as its last character.
The following sections describe the various characters and
metacharacters used in JavaScript regular expressions. Note, however,
that a complete tutorial on regular expression grammar is beyond the
scope of this book. For complete details of the syntax, consult a
book on Perl, such as Programming Perl, by Larry
Wall, Tom Christiansen, and Jon Orwant (O'Reilly).
Mastering Regular Expressions, by Jeffrey E.F.
Friedl (O'Reilly), is another excellent source of information
on regular expressions.
10.1.1. Literal Characters
As we've seen, all alphabetic
characters and digits match themselves literally in regular
expressions. JavaScript regular expression syntax also supports
certain nonalphabetic characters through escape sequences that begin
with a backslash (\). For
example, the sequence \n matches a literal newline
character in a string. Table 10-1 lists these
characters.
Table 10-1. Regular expression literal characters
Character
|
Matches
|
Alphanumeric character
|
Itself
|
\0
|
The NUL character (\u0000)
|
\t
|
Tab (\u0009)
|
\n
|
Newline (\u000A)
|
\v
|
Vertical tab (\u000B)
|
\f
|
Form feed (\u000C)
|
\r
|
Carriage return (\u000D)
|
\xnn
|
The Latin character specified by the hexadecimal number
nn; for example, \x0A
is the same as \n
|
\uxxxx
|
The Unicode character specified by the hexadecimal number
xxxx; for example,
\u0009 is the same as \t
|
\cX
|
The control character
^X; for example,
\cJ is equivalent to the newline character
\n
|
A number of punctuation characters have special meanings in regular
expressions. They are:
^ $ . * + ? = ! : | \ / ( ) [ ] { }
We'll learn the meanings of these characters in the sections
that follow. Some of these characters have special meaning only
within certain contexts of a regular expression and are treated
literally in other contexts. As a general rule, however, if you want
to include any of these punctuation characters literally in a regular
expression, you must precede them with a \. Other
punctuation characters, such as quotation marks and
@, do not have special meaning and simply match
themselves literally in a regular expression.
If you can't remember exactly which punctuation characters need
to be escaped with a backslash, you may safely place a backslash
before any punctuation character. On the other hand, note that many
letters and numbers have special meaning when preceded by a
backslash, so any letters or numbers that you want to match literally
should not be escaped with a backslash. To include a backslash
character literally in a regular expression, you must escape it with
a backslash, of course. For example, the following regular expression
matches any string that includes a backslash:
/\\/.
10.1.2. Character Classes
Individual literal characters can be combined into
character classes
by placing them within
square brackets.
A character class matches any one character that is contained within
it. Thus, the regular expression /[abc]/ matches
any one of the letters a, b, or c. Negated character classes can also
be defined -- these match any character except those contained
within the brackets. A negated character class is specified by placing a caret
(^) as the first character inside the left
bracket. The regexp /[^abc]/ matches any one
character other than a, b, or c. Character classes can use a
hyphen to indicate a range of
characters. To match any one lowercase character from the Latin
alphabet, use /[a-z]/, and to match any letter or
digit from the Latin alphabet, use /[a-zA-Z0-9]/.
Because certain character classes are commonly used, the JavaScript
regular expression syntax includes special characters and escape
sequences to represent these common classes. For example,
\s
matches the space
character, the tab character, and any other
Unicode whitespace character, and
\S matches any character that is
not Unicode whitespace. Table 10-2 lists these characters and summarizes
character class syntax. (Note that several of these character class
escape sequences match only ASCII characters and have not been
extended to work with Unicode characters. You can explicitly define
your own Unicode character classes; for example,
/[\u0400-04FF]/ matches any one Cyrillic
character.)
Table 10-2. Regular expression character classes
Character
|
Matches
|
[...]
|
Any one character between the brackets.
|
[^...]
|
Any one character not between the brackets.
|
.
|
Any character except newline or another Unicode line terminator.
|
\w
|
Any ASCII word character. Equivalent to
[a-zA-Z0-9_].
|
\W
|
Any character that is not an ASCII word character. Equivalent to
[^a-zA-Z0-9_].
|
\s
|
Any Unicode whitespace character.
|
\S
|
Any character that is not Unicode whitespace. Note that
\w and \S are not the same
thing.
|
\d
|
Any ASCII digit.
Equivalent to [0-9].
|
\D
|
Any character other than an ASCII digit. Equivalent to
[^0-9].
|
[\b]
|
A literal backspace (special case).
|
Note that the special character class escapes can be used within
square brackets. \s matches any whitespace
character and \d matches any digit, so
/[\s\d]/ matches any one whitespace character or
digit. Note that there is one special case. As we'll see later,
the \b escape has a special meaning.
When used within a character class,
however, it represents the backspace character. Thus, to represent a
backspace character literally in a regular expression, use the
character class with one element: /[\b]/.
10.1.3. Repetition
With
the regular expression syntax we have learned so far, we can describe
a two-digit number as /\d\d/ and a four-digit
number as /\d\d\d\d/. But we don't have any
way to describe, for example, a number that can have any number of
digits or a string of three letters followed by an optional digit.
These more complex patterns use regular expression syntax that
specifies how many times an element of a regular expression may be
repeated.
The characters that specify repetition always follow the pattern to
which they are being applied. Because certain types of repetition are
quite commonly used, there are special characters to represent these
cases. For example, + matches one or more
occurrences of the previous pattern. Table 10-3
summarizes the repetition syntax. The following lines show some
examples:
/\d{2,4}/ // Match between two and four digits
/\w{3}\d?/ // Match exactly three word characters and an optional digit
/\s+java\s+/ // Match "java" with one or more spaces before and after
/[^"]*/ // Match zero or more non-quote characters
Table 10-3. Regular expression repetition characters
Be careful when using the * and
? repetition characters. Since these characters
may match zero instances of whatever precedes them, they are allowed
to match nothing. For example, the regular expression
/a*/ actually matches the string
"bbbb", because the string contains zero occurrences of
the letter a!
10.1.3.1. Non-greedy repetition
The
repetition
characters listed in Table 10-3 match as many times
as possible while still allowing any following parts of the regular
expression to match. We say that the repetition is
"greedy." It is also possible (in JavaScript 1.5 and
later -- this is one of the Perl 5 features not implemented in
JavaScript 1.2) to specify that repetition should be done in a
non-greedy way. Simply follow the repetition character or characters
with a question mark: ??, +?,
*?, or even {1,5}?. For
example, the regular expression /a+/ matches one
or more occurrences of the letter a. When applied to the string
"aaa", it matches all three letters. But
/a+?/ matches one or more occurrences of the
letter a, matching as few characters as necessary. When applied to
the same string, this pattern matches only the first letter a.
Using non-greedy repetition may not always produce the results you
expect. Consider the pattern /a*b/, which matches
zero or more letters a followed by the letter b. When applied to the
string "aaab", it matches the entire string. Now
let's use the non-greedy version: /a*?b/.
This should match the letter b preceded by the fewest number of
a's possible. When applied to the same string
"aaab", you might expect it to match only the last letter
b. In fact, however, this pattern matches the entire string as well,
just like the greedy version of the pattern. This is because regular
expression pattern matching is done by finding the first position in
the string at which a match is possible. The non-greedy version of
our pattern does match at the first character of the string, so this
match is returned; matches at subsequent characters are never even
considered.
10.1.4. Alternation, Grouping, and References
The regular
expression grammar includes
special characters for specifying alternatives, grouping
subexpressions, and referring to previous subexpressions. The
| character separates
alternatives. For example, /ab|cd|ef/ matches the
string "ab" or the string "cd" or the string
"ef". And /\d{3}|[a-z]{4}/ matches
either three digits or four lowercase letters.
Note that alternatives are considered left to right until a match is
found. If the left alternative matches, the right alternative is
ignored, even if it would have produced a "better" match.
Thus, when the pattern /a|ab/ is applied to the
string "ab", it matches only the first letter.
Parentheses have several purposes
in regular expressions. One purpose is to group separate items into a
single subexpression, so that the items can be treated as a single
unit by |, *,
+, ?, and so on. For example,
/java(script)?/ matches "java"
followed by the optional "script". And
/(ab|cd)+|ef)/ matches either the string
"ef" or one or more repetitions of either of the strings
"ab" or "cd".
Another purpose of parentheses in
regular expressions is to define subpatterns within the complete
pattern. When a regular expression is successfully matched against a
target string, it is possible to extract the portions of the target
string that matched any particular parenthesized subpattern.
(We'll see how these matching substrings are obtained later in
the chapter.) For example, suppose we are looking for one or more
lowercase letters followed by one or more digits. We might use the
pattern /[a-z]+\d+/. But suppose we only really
care about the digits at the end of each match. If we put that part
of the pattern in parentheses (/[a-z]+(\d+)/), we
can extract the digits from any matches we find, as explained later.
A related use of parenthesized subexpressions is to allow us to refer
back to a subexpression later in the same regular expression. This is
done by following a \ character by a digit or
digits. The digits refer to the position of the parenthesized
subexpression within the regular expression. For example,
\1 refers back to the first subexpression and
\3 refers to the third. Note that, because
subexpressions can be nested within others, it is the position of the
left parenthesis that is counted. In the following regular
expression, for example, the nested subexpression
([Ss]cript) is referred to as
\2:
/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/
A reference to a previous subexpression of a regular expression does
not refer to the pattern for that subexpression,
but rather to the text that matched the pattern. Thus, references can
be used to enforce a constraint that separate portions of a string
contain exactly the same characters. For example, the following
regular expression matches zero or more characters within single or
double quotes. However, it does not require the opening and closing
quotes to match (i.e., both single quotes or both double quotes):
/['"][^'"]*['"]/
To require the quotes to match, we can use a reference:
/(['"])[^'"]*\1/
The \1 matches whatever the first parenthesized
subexpression matched. In this example, it enforces the constraint
that the closing quote match the opening quote. This regular
expression does not allow single quotes within double-quoted strings
or vice versa. It is not legal to use a reference within a character
class, so we cannot write:
/(['"])[^\1]*\1/
Later in this chapter, we'll see that this kind of reference to
a parenthesized sub-expression is a powerful feature of regular
expression search-and-replace operations.
In JavaScript 1.5 (but not JavaScript 1.2), it is possible to group
items in a regular expression without creating a numbered reference
to those items. Instead of simply grouping the items within
( and ), begin the group with
(?: and end it with ). Consider
the following pattern, for example:
/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/
Here, the subexpression (?:[Ss]cript) is used
simply for grouping, so the ? repetition character
can be applied to the group. These modified parentheses do not
produce a reference, so in this regular expression,
\2 refers to the text matched by
(fun\w*).
Table 10-4 summarizes the regular expression
alternation, grouping, and referencing operators.
Table 10-4. Regular expression alternation, grouping, and reference characters
10.1.5. Specifying Match Position
We've seen that many elements
of a regular expression match a single character in a string. For
example, \s matches a single character of
whitespace. Other regular expression elements match the positions
between characters, instead of actual characters.
\b , for example, matches
a word boundary -- the boundary between a \w
(ASCII word character) and a \W (non-word
character), or the boundary between an ASCII word character and the
beginning or end of a string.[37] Elements like
\b do not specify any characters to be used in a
matched string; what they do specify, however, is legal positions at
which a match can occur. Sometimes these elements are called regular
expression anchors, because they anchor the
pattern to a specific position in the search string. The most
commonly used anchor elements are
^, which ties the pattern to the
beginning of the string, and $, which anchors the
pattern to the end of the string.
For example, to match the word "JavaScript" on a line by
itself, we could use the regular expression
/^JavaScript$/. If we wanted to search for
"Java" used as a word by itself (not as a prefix, as it
is in "JavaScript"), we might try the pattern
/\sJava\s/, which requires a space before and
after the word. But there are two problems with this solution. First,
it does not match "Java" if that word appears at the
beginning or the end of a string, but only if it appears with space
on either side. Second, when this pattern does find a match, the
matched string it returns has leading and trailing spaces, which is
not quite what we want. So instead of matching actual space
characters with \s, we instead match (or anchor
to) word boundaries with \b. The resulting
expression is /\bJava\b/. The element
\B anchors the match to a location that is not a
word boundary. Thus, the pattern /\B[Ss]cript/
matches "JavaScript" and "postscript", but
not "script" or "Scripting".
In JavaScript 1.5 (but not JavaScript 1.2), you can also use
arbitrary regular expressions as anchor conditions. If you include an
expression within (?=
and ) characters,
it is a look-ahead assertion, and it specifies that the following
characters must match, without actually matching them. For example,
to match the name of a common programming language, but only if it is
followed by a colon, you could use
/[Jj]ava([Ss]cript)?(?=\:)/. This pattern matches
the word "JavaScript" in "JavaScript: The
Definitive Guide", but it does not match "Java" in
"Java in a Nutshell" because it is not followed by a
colon.
If you instead introduce an assertion with
(?!
, it is a negative look-ahead
assertion, which specifies that the following characters must not
match. For example, /Java(?!Script)([A-Z]\w*)/
matches "Java" followed by a capital letter and any
number of additional ASCII word characters, as long as
"Java" is not followed by "Script". It
matches "JavaBeans" but not "Javanese", and
it matches "JavaScrip" but not "JavaScript"
or "JavaScripter".
Table 10-5 summarizes regular expression
anchors.
Table 10-5. Regular expression anchor characters
10.1.6. Flags
There is one final element of regular
expression grammar. Regular expression flags specify high-level
pattern-matching rules. Unlike the rest of regular expression syntax,
flags are specified outside of the / characters;
instead of appearing within the slashes, they appear following the
second slash. JavaScript 1.2
supports two flags. The
i flag specifies that pattern matching should be
case-insensitive. The g flag specifies
that pattern matching should be global -- that is, all matches
within the searched string should be found. Both flags may be
combined to perform a global case-insensitive match.
For example, to do a case-insensitive search for the first occurrence
of the word "java" (or "Java",
"JAVA", etc.), we could use the case-insensitive regular
expression /\bjava\b/i. And to find all
occurrences of the word in a string, we would add the
g flag: /\bjava\b/gi.
JavaScript 1.5 supports an additional flag: m. The
m flag performs pattern matching in
multiline mode. In this mode, if the string to be searched contains
newlines, the ^ and $ anchors
match the beginning and end of a line in addition to matching the
beginning and end of a string. For example, the pattern
/Java$/im matches "java" as well as
"Java\nis fun".
Table 10-6 summarizes these regular expression
flags. Note that we'll see more about the g
flag later in this chapter, when we consider the String and RegExp
methods used to actually perform matches.
Table 10-6. Regular expression flags
Character
|
Meaning
|
i
|
Perform case-insensitive matching.
|
g
|
Perform a global match. That is, find all matches rather than
stopping after the first match.
|
m
|
Multiline mode. ^ matches beginning of line or
beginning of string, and $ matches end of line or
end of string.
|
 |  |  | 9.2. Array Methods |  | 10.2. String Methods for Pattern Matching |
Copyright © 2003 O'Reilly & Associates. All rights reserved.
|