16.1. Introduction
While
everyone who programs in PHP has to learn some English eventually to
get a handle on its function names and language constructs, PHP can
create applications that speak just about any language. Some
applications need to be used by speakers of many different languages.
Taking an application written for French speakers and making it
useful for German speakers is made easier by PHP's
support for internationalization and localization.
Internationalization (often abbreviated
I18N[14])
is the process of taking an application designed for just one locale
and restructuring it so that it can be used in many different
locales.
Localization
(often abbreviated
L10N[15]) is
the process of adding support for a new locale to an
internationalized application.
A locale is a group
of settings that describe text formatting and language customs in a
particular area of the world. The settings are divided into six
categories:
- LC_COLLATE
-
These settings control
text sorting: which letters go before
and after others in alphabetical order.
- LC_CTYPE
-
These settings control mapping between
uppercase and lowercase letters as well as which characters fall into
the different character classes, such as alphanumeric characters.
- LC_MONETARY
-
These settings describe the
preferred format of currency information, such as what character to
use as a decimal point and how to indicate negative amounts.
- LC_NUMERIC
-
These settings describe the
preferred format of numeric information, such as how to group numbers
and what character is used as a thousands separator.
- LC_TIME
-
These settings describe the
preferred format of time and date information, such as names of
months and days and whether to use 24- or 12-hour time.
- LC_MESSAGES
-
This category contains
text messages used by applications that need to display information
in multiple languages.
There is also a metacategory, LC_ALL, that
encompasses all the categories.
A locale
name generally has three components. The first, an abbreviation that
indicates a language, is mandatory. For example,
"en" for English or
"pt" for Portuguese. Next, after an
underscore, comes an optional country specifier, to distinguish
between different countries that speak different versions of the same
language. For example, "en_US" for
U.S. English and "en_GB" for
British English, or "pt_BR" for
Brazilian Portuguese and "pt_PT"
for Portuguese Portuguese. Last, after a period,
comes an optional character-set
specifier. For example,
"zh_TW.Big5" for Taiwanese Chinese
using the Big5 character set. While most locale names follow these
conventions, some don't. One difficulty in using
locales is that they can be arbitrarily named. Finding and setting a
locale is discussed in Section 16.2 through
Section 16.4.
Different techniques are necessary for correct localization of plain
text, dates and times, and currency. Localization can also be applied
to external entities your program uses, such as images and included
files. Localizing these kinds of content is covered in Section 16.5 through Section 16.9.
Systems for dealing with large amounts of localization data are
discussed in Section 16.10 and Section 16.11. Section 16.10 shows
some simple ways to manage the data, and Section 16.11 introduces GNU gettext, a
full-featured set of tools that provide localization support.
PHP also has limited
support for Unicode. Converting data to and from the Unicode UTF-8
encoding is addressed in Section 16.12.