localedef(4)

NAME

localedef — format and semantics of locale definition file

DESCRIPTION

This is a description of the syntax and meaning of the locale definition that is provided as input to the localedef command to create a locale (see localedef(1M)).

The following is a list of category tags, keywords and subsequent expressions which are recognized by localedef. The order of keywords within a category is irrelevant with the exception of the copy keyword and other exceptions noted under the LC_COLLATE description. (Note that, as a convention, the category tags are composed of uppercase characters, while the keywords are composed of lowercase characters).

Category Tags and Keywords

The following keywords do not belong to any category and should appear in the beginning of the locale definition file:

comment_char: Single character indicating the character to be interpreted as starting a comment line within the locale definition file. This character should be in the first column of a comment line. The default comment_char is #. All lines with a comment_char in the first column are ignored.
escape_char: A single character indicating the character to be interpreted as an escape character within the script. The default escape_char is \. escape_char is used to escape localedef metacharacters to remove special meaning and in the character constant decimal, octal, and hexadecimal formats. It is also used to continue a line onto the next, if escape_char is the last character on the line (before the new-line character).

The following keywords can be used in any category:

copy: A string naming another valid locale available on the system. This causes the category in the locale being created to be a copy of the same category in the named locale. Since the copy keyword defines the entire category, if used, it must be the only keyword in the category.

The following six categories are recognized:

LC_CTYPE:

This category defines character classification, case conversion and other character attributes. The following predefined character classifications are recognized:

upper: Character codes classified as uppercase letters. Characters specified in the cntrl, digit, punct or space classifications cannot be specified in this category.
lower: Character codes classified as lowercase letters. Same restrictions applicable to the upper category apply to this classification.
digit: Character codes classified as numeric. Only ten characters in contiguous ascending sequence by numerical value can be specified. Alternative digits cannot be specified here.
space: Character codes classified as white-space. No character specified for the upper, lower, alpha, digit, graph or xdigit categories can be included in this classification.
punct: Character codes classified as punctuation characters. No character included in the upper, lower, alpha, digit, cntrl, xdigit or space categories can be specified.
cntrl: Character codes classified as control characters. No character included in the upper, lower, alpha, digit, punct, graph, print or xdigit can be included here.
blank: Character codes classified as blank characters. The <space> and <tab> characters are automatically included.
xdigit: Character codes classified as hexadecimal digits. Only the characters defined for the digit class can be specified, followed by one or more sets of six characters, with each set in ascending order.
alpha: Character codes classified as letters. Characters classified as cntrl, digit, punct or space cannot be specified. Characters specified as upper and lower classes are automatically included in this class.
print: Character codes classified as printable characters. Characters specified for upper, lower, alpha, digit, xdigit, and punct classes and the <space> character are automatically included. No character from the cntrl category can be specified.
graph: Character codes classified as printable characters, except the <space> character. In all other respect this classification is similar to the print category.

The following two are special classifications, used to designate valid first-of-two and second-of-two bytes. Note that these are byte classifications and not character classifications; hence, they cannot be used with the iswctype interface (see wctype(3C)), in the same manner as the other classifications can be used.

first: Valid first bytes of two-byte characters.
second: Valid second bytes of two-byte characters.

Character case conversion definitions:

toupper: Lowercase to uppercase character relationships.
tolower: Uppercase to lowercase character relationships.

Miscellaneous character attribute and classifications:

alt_punct: String mapped into the ASCII equivalent string ``b!"#$%&'()*+,-./:;<=>?@[\]^_`{}~'', where b is a blank (a langinfo(5) item).
charclass: Defines one or more locale-specific character class names as strings separated by semicolons. Each named character class can then be defined subsequently in the LC_CTYPE definition. The first character of a character class name must be a letter and the class name cannot match any of the predefined classifications (for example, space, letter, cntrl).
direction: String operand indicates text direction (a langinfo(5) item). String operand "1" indicates right-to-left text direction.
context: String operand indicates character context analysis. String "1" indicates Arabic context analysis is required.

LC_COLLATE:

The LC_COLLATE category provides collation sequence definition for relative ordering between collating elements (single and multi-character collating elements) in the locale. The following keywords belong to this category and should come between the category tag LC_COLLATE and END LC_COLLATE. The first two keywords can be in any order, but must come before the order_start keyword. Any number of the first two keywords can be specified.

collating-element <symbol> from string

Defines a multi-character collating element, symbol, composed of the characters in string. String is limited to two characters.

collating-symbol <symbol>

Makes symbol a collating symbol which can be used to define a place in the collating sequence. Symbol does not represent any actual character.

order_start

Denotes the start of the collation sequence. The directives have an effect on string collation.

The lines following the order_start keyword and before the order_end keyword contain collating element entries, one per line.

Operands can optionally appear after the order_start keyword to defined rules for string comparison using a multiple-weight scheme (if no operands are specified, a single forward operand is assumed). The possible operands are:

forward: Specifies that comparison operations proceed from start of string towards the end of it.
backward: Specifies that comparison operations proceed from end of string towards the beginning of it.

order_end

Marks the end of the list of collating element entries.

LC_MONETARY:

The LC_MONETARY category defines the rules and symbols used to format monetary numeric information. The following keywords belong to this category and should come between the category tag LC_MONETARY and END LC_MONETARY:

int_curr_symbol

The operand is a four-character string used to designate the international currency symbol. The first three characters should contain the alphabetic international currency symbol in accordance with those specified in the ISO 4217 standard. The fourth character is the character used to separate the international currency symbol from the monetary quantity.

currency_symbol

The operand is a string used as the local currency symbol.

mon_decimal_point

The operand is a string containing the symbol used as the decimal delimiter (radix character).

mon_thousands_sep

The operand is a string containing the symbol used as a separator for groups of digits to the left of decimal delimiter.

mon_grouping

The operand is a semicolon-separated list of integers. The initial integer defines the size of the group immediately preceding the decimal delimiter, and the following integers define the preceding groups. If the last integer is not -1, then the size of the previous group (if any) will be repeatedly used for the remainder of the digits. If the last integer is -1, then no further grouping will be performed.

positive_sign

The operand is a string to indicate a non-negative monetary quantity.

negative_sign

The operand is a string to indicate a negative monetary quantity.

int_frac_digits

The operand is an integer representing the number of fractional digits used in formatted monetary values using int_curr_symbol.

frac_digits

The operand is an integer representing the number of fractional digits used in formatted monetary values using currency_symbol.

p_cs_precedes

The operand is an integer which if set to 1 indicates the currency_symbol precedes a monetary quantity, and if set to 0 the symbol succeeds the value.

p_sep_by_space

The operand is an integer which indicates the separation of the currency_symbol, the sign string, and the value for a non-negative formatted monetary quantity.

The value of p_sep_by_space, n_sep_by_space, int_p_sep_by_space, and int_n_sep_by_space are interpreted according to the following:

0: No space separates the currency symbol and value.
1: If the currency symbol and sign string are adjacent, a space separates them from the value; otherwise, a space separates the currency symbol from the value.
2: If the currency symbol and sign string are adjacent, a space separates them; otherwise, a space separates the sign string from the value.

n_cs_precedes

The operand is an integer which if set to 1 indicates the currency_symbol precedes a negative monetary quantity, and if set to 0 the symbol succeeds the negative value.

n_sep_by_space

The operand is an integer which indicates the separation of the currency_symbol, the sign string, and the value for a negative formatted monetary quantity.

p_sign_posn

The operand is an integer which indicates the positioning of the positive_sign for a positive monetary quantity. The possible values are:

0: Parenthesis surround the quantity and the currency_symbol or int_curr_symbol.
1: The sign string precedes the quantity and the currency_symbol or int_curr_symbol.
2: The sign string succeeds the quantity and the currency_symbol or int_curr_symbol.
3: The sign string precedes the currency_symbol or int_curr_symbol.
4: The sign string succeeds the currency_symbol or int_curr_symbol.

n_sign_posn

The operand is an integer set to a value indicating the positioning of the negative_sign for a negative formatted monetary quantity.

int_p_cs_precedes

The operand is an integer which if set to 1 indicates the int_currency_symbol precedes a monetary quantity, and if set to 0 the symbol succeeds the value.

int_p_sep_by_space

The operand is an integer which indicates the separation of the int_currency_symbol, the sign string, and the value for a non-negative internationally formatted monetary quantity.

int_n_cs_precedes

The operand is an integer which if set to 1 indicates the int_currency_symbol precedes a negative monetary quantity, and if set to 0 the symbol succeeds the negative value.

int_n_sep_by_space

The operand is an integer which indicates the separation of the int_currency_symbol, the sign string, and the value for a negative internationally formatted monetary quantity.

int_p_sign_posn

The operand is an integer which indicates the positioning of the positive_sign for a positive monetary quantity formatted with the international format.

int_n_sign_posn

The operand is an integer which indicates the positioning of the negative_sign for a negative monetary quantity formatted with the international format.

LC_NUMERIC:

The LC_NUMERIC category defines rules and symbols used to format non-monetary numeric information. The following keywords belong to this category and should come between the category tag LC_NUMERIC and END LC_NUMERIC:

decimal_point: The operand is a string containing the symbol used as the decimal delimiter (radix character) in numeric, non-monetary formatted quantities. This keyword cannot be omitted and cannot be set to the empty string.
thousands_sep: The operand is a string containing the symbol used as a separator for groups of digits to the left of the decimal delimiter.
grouping: The operand is a semicolon-separated list of integers. The initial integer defines the size of the group immediately preceding the decimal delimiter, and the following integers define the preceding groups. If the last integer is not -1, then the size of the previous group (if any) will be repeatedly used for the remainder of the digits. If the last integer is -1, then no further grouping will be performed.
alt_digit: String mapped into the ASCII equivalent string "0123456789b+-.,eE ", where b is a blank (a langinfo(5) item). The alt_digit keyword is an HP extension to the localedef POSIX standards and it has a different meaning than the alt_digits defined in POSIX standards.

LC_TIME:

The LC_TIME category defines the rules for generating locale-specific formatted date strings. The following mandatory keywords belong to this category and should come between the category tag LC_TIME and END LC_TIME:

abday

Seven semicolon-separated strings giving abbreviated names for the days of the week beginning with Sunday.

day

Seven semicolon-separated strings giving full names for the days of the week beginning with Sunday.

abmon

Twelve semicolon-separated strings giving abbreviated names for the months, beginning with January.

mon

Twelve semicolon-separated strings giving full names for the months, beginning with January.

d_t_fmt

The operand is a string defining the appropriate date and time representation.

d_fmt

The operand is a string defining the appropriate date representation.

t_fmt

The operand is a string defining the appropriate time representation.

am_pm

The operand is two semicolon-separated strings giving the representations for AM and PM.

t_fmt_ampm

The operand is a string defining the appropriate time representation in the 12-hour clock format with am_pm.

era

The operand is a semi-colon-separated list of strings. Each string defines the name and date of an era or emperor for a locale. Each string should conform to the following format:

direction:offset:start_date:end_date:name:format

where:

direction: Either a + or - character. The + character indicates the time axis should be such that the years count in the positive direction when moving from the starting date towards the ending date. The - character indicates the time axis should be such that the years count in the negative direction when moving from the starting date towards the ending date.
offset: A number in the range [SHRT_MIN,SHRT_MAX] indicating the number of the first year of the era.
start_date: A date in the form yyyy/mm/dd where yyyy, mm, and dd are the year, month and day numbers, respectively, of the start of the era. Years prior to the year 0 A.D. are represented as negative numbers. For example, an era beginning March 5th in the year 100 B.C. would be represented as 3-100/3/5. Years in the range [SHRT_MIN+1,SHRT_MAX-1] are supported.
end_date: The ending date of the era in the same form as the start_date above or one of the two special values -* or +*. A value of -* indicates the ending date of the era extends to the beginning of time while +* indicates it extends to the end of time. The ending date can be chronologically either before or after the starting date of an era. For example, the expressions for the Christian eras A.D. and B.C. would be:
+:0:0000/01/01:+*:A.D.:%o %N +:1:-0001/12/31:-*:B.C.:%o %N
name: A string representing the name of the era which is substituted for the %N directive of date and strftime() (see date(1) and strftime(3C)).
format: A string for formatting the %E directive of date and strftime(). This string is usually a function of the %o and %N directives. If format is not specified, the string specified for the LC_TIME category keyword era_d_fmt (see below) is used as a default.

era_d_fmt

The operand is a string defining the format of date in era notation.

era_t_fmt

The operand is a string defining the format of time in era notation.

era_d_t_fmt

The operand is a string defining the format of date and time in era notation.

alt_digits

The operand is a semi-colon-separated list of strings. The first string is the alternative symbol corresponding to zero, the second string is the alternative symbol corresponding to one, and so on. Note that if the HP-UX-proprietary alt_digit keyword has been specified in the same locale, the first ten symbols should be identical for these two keywords.

In addition to the above, the following HP-UX-proprietary keywords are recognized (these are provided for backward compatibility and their use is otherwise not recommended): year_unit, mon_unit, day_unit, hour_unit, min_unit, sec_unit.

LC_MESSAGES:

The LC_MESSAGES category defines the format and values for affirmative and negative responses. The following keywords belong to this category and should come between the category tag LC_MESSAGES and END LC_MESSAGES:

yesexpr: The string operand is an Extended Regular Expression matching acceptable affirmative responses to yes/no queries.
noexpr: The string operand is an Extended Regular Expression matching acceptable negative responses to yes/no queries.
yesstr: The string operand identifies the affirmative response for yes/no questions. This keyword is now obsolete and yesexpr should be used instead.
nostr: The string operand identifies the negative response for yes/no questions This keyword is now obsolete and noexpr should be used instead.

Keyword Operands

Keyword operands consist of character-code constants and symbols, strings, and metacharacters. The types of legal expressions are: character lists, string lists, integer lists, shift, collating element entries, regular expression, character constants and string:

character lists

character list operands consist of single character-code constants or symbolic names separated by semicolons, or a character-code range consisting of a constant or symbolic name followed by an ellipsis followed by another constant or symbolic name. The constant preceding the ellipsis must have a smaller code value than the constant following the ellipsis. A range represents a set of consecutive character codes. If the list is longer than a single line, the escape character must be used at the end of each line as a continuation character. It is an error to use any symbolic name that is not defined in an accompanying charmap file (see charmap(4)).

string lists

string list operands consist of strings separated by semicolons. If longer than one line, the escape character must be used for continuation.

string

string operands consist of a sequence of zero or more characters surrounded by double quotes ("). Within a string, the double-quote character must be preceded by an escape character. The following escape sequences also can be used:

\n

newline

\t

horizontal tab

\b

backspace

\r

carriage return

\f

form feed

\\

backslash

\'

single quote

\ddd

bit pattern

The escape \ddd consists of the escape character followed by 1, 2, or 3 octal digits specifying the value of the desired character (for other possible bit pattern specification, see character constants below). Also, an escape character (\) and an immediately-following newline are ignored.

Although the backslash (\) has been used for illustration, another escape character can be substituted by the escape_char keyword.

character constants

Constants represent character codes in the operands. They can be used in the following forms:

decimal constants

An escape character followed by a 'd' followed by up to three decimal digits.

octal constants

An escape character followed by up to three octal digits.

hexadecimal constants

An escape character followed by a 'x' followed by two hexadecimal digits.

Unicode constants

An escape character followed by a 'u' followed by four to eight hexadecimal digits which specifies a Unicode scalar value in a charmap file to be used with the -u option of the localedef command.

character constants

A single character (for example, A) having the numerical value of the character in the machine's character set.

symbolic names

A string enclosed between < and > is a symbolic name. localedef input files are recommended to be written entirely in symbolic names, utilizing a user defined or system-supplied charmap file. This aids portability of localedef input files between different encoded character sets (see charmap(4)).

Symbolic names can be defined within a locale definition file by the collating-element and collating-symbol keywords. These are not character constants. It is an error if such an internally defined symbolic name collides with one defined in a charmap file.

integer lists

Integer list operands consists of one or more decimal digits separated by semicolons.

shift

Shift operands follow keywords toupper and tolower, and must consist of two character-code constants enclosed by left and right parentheses and separated by a comma. Each such character pair is separated from the next by a semicolon. For tolower, the first constant represents an uppercase character and the second the corresponding lowercase character. For toupper, the first constant represents an lowercase character and the second the corresponding uppercase character.

collating element entry

The order_start keyword is followed by collating element entries, one per line, in ascending order by collating position. The collating element entries have the form:

collation_element[weight[;weight]]

collation_element can be a character, a collating symbol enclosed in angle brackets representing a character or collating element, the special symbol UNDEFINED or an ellipsis (...).

A character stands for itself; a collating symbol can be a symbolic name for a character that is interpreted by the charmap file, a multi-character collating element defined by a collating-element keyword, or a collating symbol defined by the collating-symbol keyword.

The special symbol UNDEFINED specifies the collating position of any characters not explicitly defined by collating element entries. For example, if some group of characters is to be omitted from the collation sequence and just collate after all defined characters, a collating symbol might be defined before the order_start keyword:

collating-symbol  <HIGH> 

Then somewhere in the list of collating element entries:

UNDEFINED  <HIGH> 

Notice that there is no second weight. This means that on a second pass all characters collate by their encoded value.

An ellipsis is interpreted as a list of characters with an encoded value higher than that of the character on the preceding line and lower than that on the following line. Because it is tied to encoded value of characters, the ellipsis is inherently non-portable. If it is used, a warning is issued and no output generated unless the -c option was given.

The weight operands provide information about how the collating element is to be collated on first and subsequent passes. Weight can be a two-character string, the special symbol IGNORE, or a collating element of any of the forms specified for collating_element except UNDEFINED. If there are no weights, the character is collating strictly by its position in the list. If there is only one weight given, the character sorts by its relative position in the list on the second collation pass.

An equivalence class is defined by a series of collating element entries all having the same character or symbol in the first weight position. For example, in many locales all forms of the character 'A' collate equal on the first pass. This is represented in the collating element entries as:

'A'    'A';'A' # first element of equivalence class 
'a'    'A';'a' # next element of class 

Two-to-one collating elements are specified by collating-elements defined before the order_start keyword. For example, the two-to-one collating element CH in Spanish, would be defined before the order_start keyword as

collating element <CH> from "CH" 

It would then be used in a collating element entry as <CH>.

A one-to-two collating element is defined by having a two-character string in one of the weight positions. For example, if the character 'X' collates equal to the pair "AE", the collating element entry would be:

'X' "AE";'X' 

A don't-care character is defined by the special symbol IGNORE. For example, the dash character, '-' may be a don't care on the first collation pass. The collating element entry is:

'-'   IGNORE;'-' 

Symbols defined by the collating-symbol keyword can be used to indicate that a given character collates higher or lower than some position in the sequence. For example if all characters with an encoded value less than that of '0' are to collate lower than all other characters on the first pass, and in relative order on the second pass, define a collating symbol before the order_start keyword:

collating-symbol    <LOW> 

The first two collating element entries are then:

...    <LOW>;... 
'0'    '0';'0' 

This also illustrates the use of the ellipsis to indicate a range. The first ellipsis is interpreted as "all characters in the encoded character set with a value lower than '0'"; the second ellipsis means that all characters in the range defined by the first collate in relative order.

regular expression

regular expression operands conform to the Extended Regular Expressions specifications as described in regexp(5).

Metacharacters

Metacharacters are characters having a special meaning to localedef in operands. To escape the special meaning of these characters, surround them with single quotes or precede them by an escape character. localedef meta-characters include:

<: Indicates the beginning of a symbolic name.
>: Indicates the end of a symbolic name.
(: Indicates the beginning of a character shift pair following the toupper and tolower keywords.
): Indicates the end of a character shift pair.
,: Used to separate the characters of a character shift pair.
": Used to quote strings.
;: Used as a separator in list operands.
escape character: Used to escape special meaning from other metacharacters and itself. It is backslash (\) by default, but can be redefined by the escape_char keyword.

Comments

Comments are lines beginning with a comment character. The comment character is pound sign (#) by default, but can be redefined by the comment_char keyword. Comments and blank lines are ignored.

Separators

Separator characters include blanks and tabs. Any number of separators can be used to delimit the keywords, metacharacters, constants and strings that comprise a localedef script except that all characters between < and > are considered to be part of the symbolic name even they are <blank>s.

EXAMPLES

Please see the files under /usr/lib/nls/loc/src for examples of locale description files. These files were used to create the various locales which are delivered with HP-UX.