Honoring Locale Settings in Regular Expressions (Perl Cookbook, 2nd Edition)

6.12. Honoring Locale Settings in Regular Expressions

6.12.1. Problem

You want to translate case when in a different locale, or you want to make \w match letters with diacritics, such as José or déjà vu.

For example, let's say you're given half a gigabyte of text written in German and told to index it. You want to extract words (with \w+) and convert them to lowercase (with lc or \L), but the normal versions of \w and lc neither match the German words nor change the case of accented letters.

6.12.2. Solution

Perl's regular-expression and text-manipulation routines have hooks to the POSIX locale setting. Under the use locale pragma, accented characters are taken care of—assuming a reasonable LC_CTYPE specification and system support for the same.

use locale;

6.12.3. Discussion

By default, \w+ and case-mapping functions operate on upper- and lowercase letters, digits, and underscores. This works only for the simplest of English words, failing even on many common imports. The use locale directive redefines what a "word character" means.

In Example 6-7 you see the difference in output between having selected the English ("en") locale and the German ("de") one.

Example 6-7. localeg

  #!/usr/bin/perl -w
  # localeg - demonstrate locale effects
  use locale;
  use POSIX 'locale_h';
  $name = "andreas k\xF6nig";
  @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii);
  setlocale(LC_CTYPE, $locale{English})
    or die "Invalid locale $locale{English}";
  @english_names = ( );
  while ($name =~ /\b(\w+)\b/g) {
          push(@english_names, ucfirst($1));
  }
  setlocale(LC_CTYPE, $locale{German})
    or die "Invalid locale $locale{German}";
  @german_names = ( );
  while ($name =~ /\b(\w+)\b/g) {
          push(@german_names, ucfirst($1));
  }
  print "English names: @english_names\n";
  print "German names:  @german_names\n";
  English names: Andreas K Nig
  German names:  Andreas König

This approach relies on POSIX locale support for 8-bit character encodings, which your system may or may not provide. Even if your system does claim to provide POSIX locale support, the standard does not specify the locale names. As you might guess, portability of this approach is not assured. If your data is already in Unicode, you don't need POSIX locales for this to work.

6.12.4. See Also

The treatment of \b, \w, and \s in perlre(1) and in the "Classic Perl Character Class Shortcuts" section of Chapter 5 of Programming Perl; the treatment of locales in Perl in perllocale(1); your system's locale(3) manpage; we discuss locales in greater depth in Recipe 6.2; the "POSIX—An Attempt at Standardization" section of Chapter 3 of Mastering Regular Expressions