Matching Letters (Perl Cookbook, 2nd Edition)

6.2.3. Discussion

Apart from Unicode properties or POSIX character classes, Perl can't directly express "something alphabetic" independent of locale, so we have to be more clever. The \w regular expression notation matches one alphabetic, numeric, or underscore character—hereafter known as an "alphanumunder" for short. Therefore, \W is one character that is not one of those. The negated character class [^\W\d_] specifies a character that must be neither a non-alphanumunder, a digit, nor an underscore. That leaves nothing but alphabetics, which is what we were looking for.

Here's how you'd use this in a program:

use locale;
use POSIX 'locale_h';

# the following locale string might be different on your system
unless (setlocale(LC_ALL, "fr_CA.ISO8859-1")) {
    die "couldn't set locale to French Canadian\n";
}

while (<DATA>) {
    chomp;
    if (/^[^\W\d_]+$/) {
        print "$_: alphabetic\n";
    } else {
        print "$_: line noise\n";
    }
}

_ _END_ _
silly
façade
coöperate
niño
Renée
Molière
hæmoglobin
naïve
tschüß
random!stuff#here

POSIX character classes help a little here; available ones are alpha, alnum, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit. These are valid only within a square-bracketed character class specification:

$phone =~ /\b[:digit:]{3}[[:space:][:punct:]]?[:digit:]{4}\b/;     # WRONG
$phone =~ /\b[[:digit:]]{3}[[:space:][:punct:]]?[[:digit:]]{4}\b/; # RIGHT

It would be easier to use properties instead, because they don't have to occur only within other square brackets:

$phone =~ /\b\p{Number}{3}[\p{Space}\p{Punctuation]?\p{Number}{4}\b/;
$phone =~ /\b\pN{3}[\pS\pP]?\pN{4}\b/;   # abbreviated form

Match any one character with Unicode property prop using \p{prop}; to match any character lacking that property, use \P{prop} or [^\p{prop}]. The relevant property when looking for alphabetics is Alphabetic, which can be abbreviated as simply Letter or even just L. Other relevant properties include UppercaseLetter, LowercaseLetter, and TitlecaseLetter; their short forms are Lu, Ll, and Lt, respectively.

6.2. Matching Letters

6.2.1. Problem

6.2.2. Solution

6.2.3. Discussion

6.2.4. See Also