Metacharacters and Metasymbols (Programming Perl)

5.3. Metacharacters and Metasymbols

Now that we've admired all the fancy cages, we can go back to looking at the critters in the cages, those funny-looking symbols you put inside the patterns. By now you'll have cottoned to the fact that these symbols aren't regular Perl code like function calls or arithmetic operators. Regular expressions are their own little language nestled inside of Perl. (There's a bit of the jungle in all of us.)

For all their power and expressivity, patterns in Perl recognize the same 12 traditional metacharacters (the Dirty Dozen, as it were) found in many other regular expression packages:

\ | ( ) [ { ^ $ * + ? .

Some of those bend the rules, making otherwise normal characters that follow them special. We don't like to call the longer sequences "characters", so when they make longer sequences, we call them metasymbols (or sometimes just "symbols"). But at the top level, those twelve metacharacters are all you (and Perl) need to think about. Everything else proceeds from there.

Some simple metacharacters stand by themselves, like . and ^ and $. They don't directly affect anything around them. Some metacharacters work like prefix operators, governing what follows them, like \. Others work like postfix operators, governing what immediately precedes them, like *, +, and ?. One metacharacter, |, acts like an infix operator, standing between the operands it governs. There are even bracketing metacharacters that work like circumfix operators, governing something contained inside them, like (...) and [...]. Parentheses are particularly important, because they specify the bounds of | on the inside, and of *, +, and ? on the outside.

If you learn only one of the twelve metacharacters, choose the backslash. (Er . . . and the parentheses.) That's because backslash disables the others. When a backslash precedes a nonalphanumeric character in a Perl pattern, it always makes that next character a literal. If you need to match one of the twelve metacharacters in a pattern literally, you write them with a backslash in front. Thus, \. matches a real dot, \$ a real dollar sign, \\ a real backslash, and so on. This is known as "escaping" the metacharacter, or "quoting it", or sometimes just "backslashing" it. (Of course, you already know that backslash is used to suppress variable interpolation in double-quoted strings.)

Although a backslash turns a metacharacter into a literal character, its effect upon a following alphanumeric character goes the other direction. It takes something that was regular and makes it special. That is, together they make a metasymbol. An alphabetical list of these metasymbols can be found below in Table 5-7.

5.3.1. Metasymbol Tables

In the following tables, the Atomic column says "yes" if the given metasymbol is quantifiable (if it can match something with width, more or less). Also, we've used "..." to represent "something else". Please see the later discussion to find out what "..." means, if it is not clear from the one-line gloss in the table.)

Table 5-4 shows the basic traditional metasymbols. The first four of these are the structural metasymbols we mentioned earlier, while the last three are simple metacharacters. The . metacharacter is an example of an atom because it matches something with width (the width of a character, in this case); ^ and $ are examples of assertions, because they match something of zero width, and because they are only evaluated to see if they're true or not.

Table 5.4. General Regex Metacharacters

Symbol	Atomic	Meaning
`\...`	Varies	De-meta next nonalphanumeric character, meta next alphanumeric character (maybe).
`...\|...`	No	Alternation (match one or the other).
`(...)`	Yes	Grouping (treat as a unit).
`[...]`	Yes	Character class (match one character from a set).
`^`	No	True at beginning of string (or after any newline, maybe).
`.`	Yes	Match one character (except newline, normally).
`$`	No	True at end of string (or before any newline, maybe).

The quantifiers, which are further described in their own section, indicate how many times the preceding atom (that is, single character or grouping) should match. These are listed in Table 5-5.

Table 5.5. Regex Quantifiers

Quantifier	Atomic	Meaning
`*`	No	Match 0 or more times (maximal).
`+`	No	Match 1 or more times (maximal).
`?`	No	Match 1 or 0 times (maximal).
`{`COUNT`}`	No	Match exactly COUNT times.
`{`MIN`,}`	No	Match at least MIN times (maximal).
`{`MIN`,`MAX`}`	No	Match at least MIN but not more than MAX times (maximal).
`*?`	No	Match 0 or more times (minimal).
`+?`	No	Match 1 or more times (minimal).
`??`	No	Match 0 or 1 time (minimal).
`{`MIN`,}?`	No	Match at least MIN times (minimal).
`{`MIN`,`MAX`}?`	No	Match at least MIN but not more than MAX times (minimal).

A minimal quantifier tries to match as few characters as possible within its allowed range. A maximal quantifier tries to match as many characters as possible within its allowed range. For instance, .+ is guaranteed to match at least one character of the string, but will match all of them given the opportunity. The opportunities are discussed later in "The Little Engine That /Could(n't)?/".

You'll note that quantifiers may never be quantified.

We wanted to provide an extensible syntax for new kinds of metasymbols. Given that we only had a dozen metacharacters to work with, we chose a formerly illegal regex sequence to use for arbitrary syntactic extensions. These metasymbols are all of the form (?KEY...); that is, a (balanced) parenthesis followed by a question mark, followed by a KEY and the rest of the subpattern. The KEY character indicates which particular regex extension it is. See Table 5-6 for a list of these. Most of them behave structurally since they're based on parentheses, but they also have additional meanings. Again, only atoms may be quantified because they represent something that's really there (potentially).

Table 5.6. Extended Regex Sequences

Extension	Atomic	Meaning
`(?#...)`	No	Comment, discard.
`(?:...)`	Yes	Cluster-only parentheses, no capturing.
`(?imsx-imsx)`	No	Enable/disable pattern modifiers.
`(?imsx-imsx:...)`	Yes	Cluster-only parentheses plus modifiers.
`(?=...)`	No	True if lookahead assertion succeeds.
`(?!...)`	No	True if lookahead assertion fails.
`(?<=...)`	No	True if lookbehind assertion succeeds.
`(?<!...)`	No	True if lookbehind assertion fails.
`(?>...)`	Yes	Match nonbacktracking subpattern.
`(?{...})`	No	Execute embedded Perl code.
`(??{...})`	Yes	Match regex from embedded Perl code.
`(?(...)...\|...)`	Yes	Match with if-then-else pattern.
`(?(...)...)`	Yes	Match with if-then pattern.

And finally, Table 5-7 shows all of your favorite alphanumeric metasymbols. (Symbols that are processed by the variable interpolation pass are marked with a dash in the Atomic column, since the Engine never even sees them.)

Table 5.7. Alphanumeric Regex Metasymbols

Symbol	Atomic	Meaning
`\0`	Yes	Match the null character (ASCII NUL).
`\`NNN	Yes	Match the character given in octal, up to `\377`.
`\`n	Yes	Match nth previously captured string (decimal).
`\a`	Yes	Match the alarm character (BEL).
`\A`	No	True at the beginning of a string.
`\b`	Yes	Match the backspace character (BS).
`\b`	No	True at word boundary.
`\B`	No	True when not at word boundary.
`\c`X	Yes	Match the control character Control-X (`\cZ`, `\c[`, etc.).
`\C`	Yes	Match one byte (C `char`) even in utf8 (dangerous).
`\d`	Yes	Match any digit character.
`\D`	Yes	Match any nondigit character.
`\e`	Yes	Match the escape character (ASCII ESC, not backslash).
`\E`	--	End case (`\L`, `\U`) or metaquote (`\Q`) translation.
`\f`	Yes	Match the form feed character (FF).
`\G`	No	True at end-of-match position of prior `m//g`.
`\l`	--	Lowercase the next character only.
`\L`	--	Lowercase till `\E`.
`\n`	Yes	Match the newline character (usually NL, but CR on Macs).
`\N{`NAME`}`	Yes	Match the named char (`\N{greek:Sigma}`).
`\p{`PROP`}`	Yes	Match any character with the named property.
`\P{`PROP`}`	Yes	Match any character without the named property.
`\Q`	--	Quote (de-meta) metacharacters till `\E`.
`\r`	Yes	Match the return character (usually CR, but NL on Macs).
`\s`	Yes	Match any whitespace character.
`\S`	Yes	Match any nonwhitespace character.
`\t`	Yes	Match the tab character (HT).
`\u`	--	Titlecase next character only.
`\U`	--	Uppercase (not titlecase) till `\E`.
`\w`	Yes	Match any "word" character (alphanumerics plus "_").
`\W`	Yes	Match any nonword character.
`\x{`abcd`}`	Yes	Match the character given in hexadecimal.
`\X`	Yes	Match Unicode "combining character sequence" string.
`\z`	No	True at end of string only.
`\Z`	No	True at end of string or before optional newline.

The braces are optional on \p and \P if the property name is one character. The braces are optional on \x if the hexadecimal number is two digits or less. The braces are never optional on \N.

Only metasymbols with "Match the..." or "Match any..." descriptions may be used within character classes (square brackets). That is, character classes are limited to containing specific sets of characters, so within them you may only use metasymbols that describe other specific sets of characters, or that describe specific individual characters. Of course, these metasymbols may also be used outside character classes, along with all the other nonclassificatory metasymbols. Note however that \b is two entirely different beasties: it's a backspace character inside the character class, but a word boundary assertion outside.

There is some amount of overlap between the characters that a pattern can match and the characters an ordinary double-quoted string can interpolate. Since regexes undergo two passes, it is sometimes ambiguous which pass should process a given character. When there is ambiguity, the variable interpolation pass defers the interpretation of such characters to the regular expression parser.

But the variable interpolation pass can only defer to the regex parser when it knows it is parsing a regex. You can specify regular expressions as ordinary double-quoted strings, but then you must follow normal double-quote rules. Any of the previous metasymbols that happen to map to actual characters will still work, even though they're not being deferred to the regex parser. But you can't use any of the other metasymbols in ordinary double quotes (or in any similar constructs such as `...`, qq(...), qx(...), or the equivalent here documents). If you want your string to be parsed as a regular expression without doing any matching, you should be using the qr// (quote regex) operator.

Note that the case and metaquote translation escapes (\U and friends) must be processed during the variable interpolation pass because the purpose of those metasymbols is to influence how variables are interpolated. If you suppress variable interpolation with single quotes, you don't get the translation escapes either. Neither variables nor translation escapes (\U, etc.) are expanded in any single quoted string, nor in single-quoted m'...' or qr'...' operators. Even when you do interpolation, these translation escapes are ignored if they show up as the result of variable interpolation, since by then it's too late to influence variable interpolation.

Although the transliteration operator doesn't take regular expressions, any metasymbol we've discussed that matches a single specific character also works in a tr/// operation. The rest do not (except for backslash, which continues to work in the backward way it always works.)

5.3.2. Specific Characters

As mentioned before, everything that's not special in a pattern matches itself. That means an /a/ matches an "a", an /=/ matches an "=", and so on. Some characters, though, aren't very easy to type in from the keyboard or, even if you manage that, don't show up on a printout; control characters are notorious for this. In a regular expression, Perl recognizes the following double-quotish character aliases:

Escape	Meaning
`\0`	Null character (ASCII NUL)
`\a`	Alarm (BEL)
`\e`	Escape (ESC)
`\f`	Form feed (FF)
`\n`	Newline (NL, CR on Mac)
`\r`	Return (CR, NL on Mac)
`\t`	Tab (HT)

Just as in double-quoted strings, Perl also honors the following four metasymbols in patterns:

\cX

A named control character, like \cC for Control-C, \cZ for Control-Z, \c[ for ESC, and \c? for DEL.

\NNN

A character specified using its two- or three-digit octal code. The leading 0 is optional, except for values less than 010 (8 decimal) since (unlike in double-quoted strings) the single-digit versions are always considered to be backreferences to captured strings within a pattern. Multiple digits are interpreted as the nth backreference if you've captured at least n substrings earlier in the pattern (where n is considered as a decimal number). Otherwise, they are interpreted as a character specified in octal.

\x{LONGHEX}

\xHEX

A character number specified as one or two hex digits ([0-9a-fA-F]), as in \x1B. The one-digit form is usable only if the character following it is not a hex digit. If braces are used, you may use as many digits as you'd like, which may result in a Unicode character. For example, \x{262f} matches a Unicode YIN YANG.

\N{NAME}

A named character, such \N{GREEK SMALL LETTER EPSILON}, \N{greek:epsilon}, or \N{epsilon}. This requires the use charnames pragma described in Chapter 31, "Pragmatic Modules", which also determines which flavors of those names you may use (":long", ":full", ":short" respectively, corresponding to the three styles just shown).

A list of all Unicode character names can be found in your closest Unicode standards document, or in PATH_TO_PERLLIB/unicode/Names.txt.

5.3.3. Wildcard Metasymbols

Three special metasymbols serve as generic wildcards, each of them matching "any" character (for certain values of "any"). These are the dot ("."), \C, and \X. None of these may be used in a character class. You can't use the dot there because it would match (nearly) any character in existence, so it's something of a universal character class in its own right. If you're going to include or exclude everything, there's not much point in having a character class. The special wildcards \C and \X have special structural meanings that don't map well to the notion of choosing a single Unicode character, which is the level at which character classes work.

The dot metacharacter matches any one character other than a newline. (And with the /s modifier, it matches that, too.) Like any of the dozen special characters in a pattern, to match a dot literally, you must escape it with a backslash. For example, this checks whether a filename ends with a dot followed by a one-character extension:

if ($pathname =~ /\.(.)\z/s) {
    print "Ends in $1\n";
}

The first dot, the escaped one, is the literal character, and the second says "match any character". The \z says to match only at the end of the string, and the /s modifier lets the dot match a newline as well. (Yes, using a newline as a file extension Isn't Very Nice, but that doesn't mean it can't happen.)

The dot metacharacter is most often used with a quantifier. A .* matches a maximal number of characters, while a .*? matches a minimal number of characters. But it's also sometimes used without a quantifier for its width: /(..):(..):(..)/ matches three colon-separated fields, each of which is two characters long.

If you use a dot in a pattern compiled under the lexically scoped use utf8 pragma, then it will match any Unicode character. (You're not supposed to need a use utf8 for that, but accidents will happen. The pragma may not be necessary by the time you read this.)

use utf8;
use charnames qw/:full/;
$BWV[887] = "G\N{MUSIC SHARP SIGN} minor";
($note, $black, $mode) = $BWV[887] =~ /^([A-G])(.)\s+(\S+)/;
print "That's lookin' sharp!\n" if $black eq chr(9839);

The \X metasymbol matches a character in a more extended sense. It really matches a string of one or more Unicode characters known as a "combining character sequence". Such a sequence consists of a base character followed by any "mark" characters (diacritical markings like cedillas or diereses) that combine with that base character to form one logical unit. \X is exactly equivalent to (?:\PM\pM*). This allows it to match one logical character, even when that really comprises several separate characters. The length of the match in /\X/ would exceed one character if it matched any combining characters. (And that's character length, which has little to do with byte length).

If you are using Unicode and really want to get at a single byte instead of a single character, you can use the \C metasymbol. This will always match one byte (specifically, one C language char type), even if this gets you out of sync with your Unicode character stream. See the appropriate warnings about doing this in Chapter 15, "Unicode".