Character Classes (Programming Perl)

5.4. Character Classes

In a pattern match, you may match any character that has--or that does not have--a particular property. There are four ways to specify character classes. You may specify a character classes in the traditional way using square brackets and enumerating the possible characters, or you may use any of three mnemonic shortcuts: the classic Perl classes, the new Perl Unicode properties, or the standard POSIX classes. Each of these shortcuts matches only one character from its set. Quantify them to match larger expanses, such as \d+ to match one or more digits. (An easy mistake is to think that \w matches a word. Use \w+ to match a word.)

5.4.1. Custom Character Classes

An enumerated list of characters in square brackets is called a character class and matches any one of the characters in the list. For example, [aeiouy] matches a letter that can be a vowel in English. (For Welsh add a "w", for Scottish an "r".) To match a right square bracket, either backslash it or place it first in the list.

Character ranges may be indicated using a hyphen and the a-z notation. Multiple ranges may be combined; for example, [0-9a-fA-F] matches one hex "digit". You may use a backslash to protect a hyphen that would otherwise be interpreted as a range delimiter, or just put it at the beginning or end of the class (a practice which is arguably less readable but more traditional).

A caret (or circumflex, or hat, or up arrow) at the front of the character class inverts the class, causing it to match any single character not in the list. (To match a caret, either don't put it first, or better, escape it with a backslash.) For example, [^aeiouy] matches any character that isn't a vowel. Be careful with character class negation, though, because the universe of characters is expanding. For example, that character class matches consonants--and also matches spaces, newlines, and anything (including vowels) in Cyrillic, Greek, or nearly any other script, not to mention every idiograph in Chinese, Japanese, and Korean. And someday maybe even Cirth, Tengwar, and Klingon. (Linear B and Etruscan, for sure.) So it might be better to specify your consonants explicitly, such as [cbdfghjklmnpqrstvwxyz], or [b-df-hj-np-tv-z] for short. (This also solves the issue of "y" needing to be in two places at once, which a set complement would preclude.)

Normal character metasymbols are supported inside a character class, (see "Specific Characters"), such as \n, \t, \cX, \NNN, and \N{NAME}. Additionally, you may use \b within a character class to mean a backspace, just as it does in a double-quoted string. Normally, in a pattern match, it means a word boundary. But zero-width assertions don't make any sense in character classes, so here \b returns to its normal meaning in strings. You may also use any predefined character class described later in the chapter (classic, Unicode, or POSIX), but don't try to use them as endpoints of a range--that doesn't make sense, so the "-" will be interpreted literally.

All other metasymbols lose their special meaning inside square brackets. In particular, you can't use any of the three generic wildcards: ".", \X, or \C. The first often surprises people, but it doesn't make much sense to use the universal character class within a restricted one, and you often want to match a literal dot as part of a character class--when you're matching filenames, for instance. It's also meaningless to specify quantifiers, assertions, or alternation inside a character class, since the characters are interpreted individually. For example, [fee|fie|foe|foo] means the same thing as [feio|].

5.4.2. Classic Perl Character Class Shortcuts

Since the beginning, Perl has provided a number of character class shortcuts. These are listed in Table 5-8. All of them are backslashed alphabetic metasymbols, and in each case, the uppercase version is the negation of the lowercase version. The meanings of these are not quite as fixed as you might expect; the meanings can be influenced by locale settings. Even if you don't use locales, the meanings can change whenever a new Unicode standard comes out, adding scripts with new digits and letters. (To keep the old byte meanings, you can always use bytes. For explanations of the utf8 meanings, see "Unicode Properties" later in this chapter. In any case, the utf8 meanings are a superset of the byte meanings.)

Table 5.8. Classic Character Classes

Symbol	Meaning	As Bytes	As utf8
`\d`	Digit	`[0-9]`	`\p{IsDigit}`
`\D`	Nondigit	`[^0-9]`	`\P{IsDigit}`
`\s`	Whitespace	`[ \t\n\r\f]`	`\p{IsSpace}`
`\S`	Nonwhitespace	`[^ \t\n\r\f]`	`\P{IsSpace}`
`\w`	Word character	`[a-zA-Z0-9_]`	`\p{IsWord}`
`\W`	Non-(word character)	`[^a-zA-Z0-9_]`	`\P{IsWord}`

(Yes, we know most words don't have numbers or underscores in them; \w is for matching "words" in the sense of tokens in a typical programming language. Or Perl, for that matter.)

These metasymbols may be used either outside or inside square brackets, that is, either standalone or as part of a constructed character class:

if ($var =~ /\D/)        { warn "contains non-digit" }
if ($var =~ /[^\w\s.]/)  { warn "contains non-(word, space, dot)" }

5.4.3. Unicode Properties

Unicode properties are available using \p{PROP} and its set complement, \P{PROP}. For the rare properties with one-character names, braces are optional, as in \pN to indicate a numeric character (not necessarily decimal--Roman numerals are numeric characters too). These property classes may be used by themselves or combined in a constructed character class:

if ($var =~ /^\p{IsAlpha}+$/)      { print "all alphabetic" }
if ($var =~ s/[\p{Zl}\p{Zp}]/\n/g) { print "fixed newline wannabes" }

Some properties are directly defined in the Unicode standard, and some properties are composites defined by Perl, based on the standard properties. Zl and Zp are standard Unicode properties representing line separators and paragraph separators, while IsAlpha is defined by Perl to be a property class combining the standard properties Ll, Lu, Lt, and Lo, (that is, letters that are lowercase, uppercase, titlecase, or other). As of version 5.6.0 of Perl, you need to use utf8 for these properties to work. This restriction will be relaxed in the future.

There are a great many properties. We'll list the ones we know about, but the list is necessarily incomplete. New properties are likely to be in new versions of Unicode, and you can even define your own properties. More about that later.

The Unicode Consortium produces the online resources that turn into the various files Perl uses in its Unicode implementation. For more about these files, see Chapter 15, "Unicode". You can get a nice overview of Unicode in the document PATH_TO_PERLLIB/unicode/Unicode3.html where PATH_TO_PERLLIB is what is printed out by:

perl -MConfig -le 'print $Config{privlib}'

Most Unicode properties are of the form \p{IsPROP}. The Is is optional, since it's so common, but you may prefer to leave it in for readability.

5.4.3.1. Perl's Unicode properties

First, Table 5-9 lists Perl's composite properties. They're defined to be reasonably close to the standard POSIX definitions for character classes.

Table 5.9. Composite Unicode Properties

Property	Equivalent
`IsASCII`	`[\x00-\x7f]`
`IsAlnum`	`[\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}]`
`IsAlpha`	`[\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}]`
`IsCntrl`	`\p{IsC}`
`IsDigit`	`\p{Nd}`
`IsGraph`	`[^\pC\p{IsSpace}]`
`IsLower`	`\p{IsLl}`
`IsPrint`	`\P{IsC}`
`IsPunct`	`\p{IsP}`
`IsSpace`	`[\t\n\f\r\p{IsZ}]`
`IsUpper`	`[\p{IsLu}\p{IsLt}]`
`IsWord`	`[_\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}]`
`IsXDigit`	`[0-9a-fA-F]`

Perl also provides the following composites for each of main categories of standard Unicode properties (see the next section):

Property	Meaning	Normative
`IsC`	Crazy control codes and such	Yes
`IsL`	Letters	Partly
`IsM`	Marks	Yes
`IsN`	Numbers	Yes
`IsP`	Punctuation	No
`IsS`	Symbols	No
`IsZ`	Separators (Zeparators?)	Yes

5.4.3.2. Standard Unicode properties

Table 5-10 lists the most basic standard Unicode properties, derived from each character's category. No character is a member of more than one category. Some properties are normative; others are merely informative. See the Unicode Standard for the standard spiel on just how normative the normative information is, and just how informative the informative information isn't.

Table 5.10. Standard Unicode Properties

Property	Meaning	Normative
`IsCc`	Other, Control	Yes
`IsCf`	Other, Format	Yes
`IsCn`	Other, Not assigned	Yes
`IsCo`	Other, Private Use	Yes
`IsCs`	Other, Surrogate	Yes
`IsLl`	Letter, Lowercase	Yes
`IsLm`	Letter, Modifier	No
`IsLo`	Letter, Other	No
`IsLt`	Letter, Titlecase	Yes
`IsLu`	Letter, Uppercase	Yes
`IsMc`	Mark, Combining	Yes
`IsMe`	Mark, Enclosing	Yes
`IsMn`	Mark, Nonspacing	Yes
`IsNd`	Number, Decimal digit	Yes
`IsNl`	Number, Letter	Yes
`IsNo`	Number, Other	Yes
`IsPc`	Punctuation, Connector	No
`IsPd`	Punctuation, Dash	No
`IsPe`	Punctuation, Close	No
`IsPf`	Punctuation, Final quote	No
`IsPi`	Punctuation, Initial quote	No
`IsPo`	Punctuation, Other	No
`IsPs`	Punctuation, Open	No
`IsSc`	Symbol, Currency	No
`IsSk`	Symbol, Modifier	No
`IsSm`	Symbol, Math	No
`IsSo`	Symbol, Other	No
`IsZl`	Separator, Line	Yes
`IsZp`	Separator, Paragraph	Yes
`IsZs`	Separator, Space	Yes

Another useful set of properties has to do with whether a given character can be decomposed (either canonically or compatibly) into other simpler characters. Canonical decomposition doesn't lose any formatting information. Compatibility decomposition may lose formatting information such as whether a character is a superscript.

Property	Information Lost
`IsDecoCanon`	Nothing
`IsDecoCompat`	Something (one of the following)
`IsDCcircle`	Circle around character
`IsDCfinal`	Final position preference (Arabic)
`IsDCfont`	Variant font preference
`IsDCfraction`	Vulgar fraction characteristic
`IsDCinitial`	Initial position preference (Arabic)
`IsDCisolated`	Isolated position preference (Arabic)
`IsDCmedial`	Medial position preference (Arabic)
`IsDCnarrow`	Narrow characteristic
`IsDCnoBreak`	Nonbreaking preference on space or hyphen
`IsDCsmall`	Small characteristic
`IsDCsquare`	Square around CJK character
`IsDCsub`	Subscription
`IsDCsuper`	Superscription
`IsDCvertical`	Rotation (horizontal to vertical)
`IsDCwide`	Wide characteristic
`IsDCcompat`	Identity (miscellaneous)

Here are some properties of interest to people doing bidirectional rendering:

Property	Meaning
`IsBidiL`	Left-to-right (Arabic, Hebrew)
`IsBidiLRE`	Left-to-right embedding
`IsBidiLRO`	Left-to-right override
`IsBidiR`	Right-to-left
`IsBidiAL`	Right-to-left Arabic
`IsBidiRLE`	Right-to-left embedding
`IsBidiRLO`	Right-to-left override
`IsBidiPDF`	Pop directional format
`IsBidiEN`	European number
`IsBidiES`	European number separator
`IsBidiET`	European number terminator
`IsBidiAN`	Arabic number
`IsBidiCS`	Common number separator
`IsBidiNSM`	Nonspacing mark
`IsBidiBN`	Boundary neutral
`IsBidiB`	Paragraph separator
`IsBidiS`	Segment separator
`IsBidiWS`	Whitespace
`IsBidiON`	Other Neutrals
`IsMirrored`	Reverse when used right-to-left

The following properties classify various syllabaries according to vowel sounds:

IsSylA         IsSylE         IsSylO         IsSylWAA       IsSylWII
IsSylAA        IsSylEE        IsSylOO        IsSylWC        IsSylWO
IsSylAAI       IsSylI         IsSylU         IsSylWE        IsSylWOO
IsSylAI        IsSylII        IsSylV         IsSylWEE       IsSylWU
IsSylC         IsSylN         IsSylWA        IsSylWI        IsSylWV

For example, \p{IsSylA} would match \N{KATAKANA LETTER KA} but not \N{KATAKANA LETTER KU}.

Now that we've basically told you all these Unicode 3.0 properties, we should point out that a few of the more esoteric ones aren't implemented in version 5.6.0 of Perl because its implementation was based in part on Unicode 2.0, and things like the bidirectional algorithm were still being worked out. However, by the time you read this, the missing properties may well be implemented, so we listed them anyway.

5.4.3.3. Unicode block properties

Some Unicode properties are of the form \p{InSCRIPT}. (Note the distinction between Is and In.) The In properties are for testing block ranges of a particular SCRIPT. If you have a character, and you wonder whether it were written in Greek script, you could test with:

print "It's Greek to me!\n" if chr(931) =~ /\p{InGreek}/;

That works by checking whether a character is "in" the valid range of that script type. This may be negated with \P{InSCRIPT} to find out whether something isn't in a particular script's block, such as \P{InDingbats} to test whether a string contains a non-dingbat. Block properties include the following:

InArabic       InCyrillic     InHangulJamo   InMalayalam    InSyriac
InArmenian     InDevanagari   InHebrew       InMongolian    InTamil
InArrows       InDingbats     InHiragana     InMyanmar      InTelugu
InBasicLatin   InEthiopic     InKanbun       InOgham        InThaana
InBengali      InGeorgian     InKannada      InOriya        InThai
InBopomofo     InGreek        InKatakana     InRunic        InTibetan
InBoxDrawing   InGujarati     InKhmer        InSinhala      InYiRadicals
InCherokee     InGurmukhi     InLao          InSpecials     InYiSyllables

Not to mention jawbreakers like these:

InAlphabeticPresentationForms         InHalfwidthandFullwidthForms
InArabicPresentationForms-A           InHangulCompatibilityJamo
InArabicPresentationForms-B           InHangulSyllables
InBlockElements                       InHighPrivateUseSurrogates
InBopomofoExtended                    InHighSurrogates
InBraillePatterns                     InIdeographicDescriptionCharacters
InCJKCompatibility                    InIPAExtensions
InCJKCompatibilityForms               InKangxiRadicals
InCJKCompatibilityIdeographs          InLatin-1Supplement
InCJKRadicalsSupplement               InLatinExtended-A
InCJKSymbolsandPunctuation            InLatinExtended-B
InCJKUnifiedIdeographs                InLatinExtendedAdditional
InCJKUnifiedIdeographsExtensionA      InLetterlikeSymbols
InCombiningDiacriticalMarks           InLowSurrogates
InCombiningHalfMarks                  InMathematicalOperators
InCombiningMarksforSymbols            InMiscellaneousSymbols
InControlPictures                     InMiscellaneousTechnical
InCurrencySymbols                     InNumberForms
InEnclosedAlphanumerics               InOpticalCharacterRecognition
InEnclosedCJKLettersandMonths         InPrivateUse
InGeneralPunctuation                  InSuperscriptsandSubscripts
InGeometricShapes                     InSmallFormVariants
InGreekExtended                       InSpacingModifierLetters

And the winner is:

InUnifiedCanadianAboriginalSyllabics

See PATH_TO_PERLLIB/unicode/In/*.pl to get an up-to-date listing of all of these character block properties. Note that these In properties are only testing to see if the character is in the block of characters allocated for that script. There is no guarantee that all characters in that range are defined; you also need to test against one of the Is properties discussed earlier to see if the character is defined. There is also no guarantee that a particular language doesn't use characters outside its assigned block. In particular, many European languages mix extended Latin characters with Latin-1 characters.

But hey, if you need a particular property that isn't provided, that's not a big problem. Read on.

5.4.3.4. Defining your own character properties

To define your own property, you need to write a subroutine with the name of the property you want (see Chapter 6, "Subroutines"). The subroutine should be defined in the package that needs the property (see Chapter 10, "Packages"), which means that if you want to use it in multiple packages, you'll either have to import it from a module (see Chapter 11, "Modules"), or inherit it as a class method from the package in which it is defined (see Chapter 12, "Objects").

Once you've got that all settled, the subroutine should return data in the same format as the files in PATH_TO_PERLLIB/unicode/Is directory. That is, just return a list of characters or character ranges in hexadecimal, one per line. If there is a range, the two numbers are separated by a tab. Suppose you wanted a property that would be true if your character is in the range of either of the Japanese syllabaries, known as hiragana and katakana. (Together they're known as kana). You can just put in the two ranges like this:

sub InKana {
    return <<'END';
3040    309F
30A0    30FF
END
}

Alternatively, you could define it in terms of existing property names:

sub InKana {
    return <<'END';
+utf8::InHiragana
+utf8::InKatakana
END
}

You can also do set subtraction using a "-" prefix. Suppose you only wanted the actual characters, not just the block ranges of characters. You could weed out all the undefined ones like this:

sub IsKana {
    return <<'END';
+utf8::InHiragana
+utf8::InKatakana
-utf8::IsCn
END
}

You can also start with a complemented character set using the "!" prefix:

sub IsNotKana {
    return <<'END';
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
END
}

Perl itself uses exactly the same tricks to define the meanings of its "classic" character classes (like \w) when you include them in your own custom character classes (like [-.\w\s]). You might think that the more complicated you get with your rules, the slower they will run, but in fact, once Perl has calculated the bit pattern for a particular 64-bit swatch of your property, it caches it so it never has to recalculate the pattern again. (It does it in 64-bit swatches so that it doesn't even have to decode your utf8 to do its lookups.) Thus, all character classes, built-in or custom, run at essentially the same speed (fast) once they get going.

5.4.4. POSIX-Style Character Classes

Unlike Perl's other character class shortcuts, the POSIX-style character-class syntax notation, [:CLASS:], is available for use only when constructing other character classes, that is, inside an additional pair of square brackets. For example, /[.,[:alpha:][:digit:]]/ will search for one character that is either a literal dot (because it's in a character class), a comma, an alphabetic character, or a digit.

The POSIX classes available as of revision 5.6 of Perl are shown in Table 5-11.

Table 5.11. POSIX Character Classes

Class	Meaning
`alnum`	Any alphanumeric, that is, an `alpha` or a `digit`.
`alpha`	Any letter. (That's a lot more letters than you think, unless you're thinking Unicode, in which case it's still a lot.)
`ascii`	Any character with an ordinal value between 0 and 127.
`cntrl`	Any control character. Usually characters that don't produce output as such, but instead control the terminal somehow; for example, newline, form feed, and backspace are all control characters. Characters with an `ord` value less than 32 are most often classified as control characters.
`digit`	A character representing a decimal digit, such as `0` to `9`. (Includes other characters under Unicode.) Equivalent to `\d`.
`graph`	Any alphanumeric or punctuation character.
`lower`	A lowercase letter.
`print`	Any alphanumeric or punctuation character or space.
`punct`	Any punctuation character.
`space`	Any space character. Includes tab, newline, form feed, and carriage return (and a lot more under Unicode.) Equivalent to `\s`.
`upper`	Any uppercase (or titlecase) letter.
`word`	Any identifier character, either an `alnum` or underline.
`xdigit`	Any hexadecimal digit. Though this may seem silly (`[0-9a-fA-F]` works just fine), it is included for completeness.

You can negate the POSIX character classes by prefixing the class name with a ^ following the [:. (This is a Perl extension.) For example:

POSIX	Classic
`[:^digit:]`	`\D`
`[:^space:]`	`\S`
`[:^word:]`	`\W`

If the use utf8 pragma is not requested, but the use locale pragma is, the classes correlate directly with the equivalent functions in the C library's isalpha(3) interface (except for word, which is a Perl extension, mirroring \w).

If the utf8 pragma is used, POSIX character classes are exactly equivalent to the corresponding Is properties listed in Table 5-9. For example [:lower:] and \p{Lower} are equivalent, except that the POSIX classes may only be used within constructed character classes, whereas Unicode properties have no such restriction and may be used in patterns wherever Perl shortcuts like \s and \w may be used.

The brackets are part of the POSIX-style [::] construct, not part of the whole character class. This leads to writing patterns like /^[[:lower:][:digit:]]+$/, to match a string consisting entirely of lowercase letters or digits (plus an optional trailing newline). In particular, this does not work:

42 =~ /^[:digit:]$/         # WRONG

That's because it's not inside a character class. Rather, it is a character class, the one representing the characters ":", "i", "t", "g", and "d". Perl doesn't care that you specified ":" twice.

Here's what you need instead:

42 =~ /^[[:digit:]]+$/

The POSIX character classes [.cc.] and [=cc=] are recognized but produce an error indicating they are not supported. Trying to use any POSIX character class in older verions of Perl is likely to fail miserably, and perhaps even silently. If you're going to use POSIX character classes, it's best to require a new version of Perl by saying:

use 5.6.0;