4.3. String Datatypes
This section
discusses datatypes derived from the xs:string
primitive datatype as well as other datatypes that have a similar
behavior (namely, xs:hexBinary, xs:base64Binary, xs:anyURI, xs:QName, and xs:NOTATION). These
types are not expected to carry any quantifiable value (W3C XML
Schema doesn't even expect to be able to sort them)
and their value space is identical to their lexical space except when
explicitly described otherwise. One should note that even though they
are grouped in this section because they have a similar behavior,
these primitive datatypes are considered quite different by the
Recommendation.
The datatypes covered in this section are shown in Figure 4-2.
Figure 4-2. Strings and similar datatypes
The two exceptions in whitespace processing (xs:string and xs:normalizedString) are string
datatypes. One of the main differences between these types is the
applied whitespace processing. To stress this difference, we will
classify these types by their whitespace processing.
4.3.1. No Whitespace Replacement
- xs:string
-
This string datatype is the only predefined
datatype for which no whitespace replacement is performed. As we will
see in the next chapter, the whitespace replacement performed on
user-defined datatypes derived from this type can be defined without
restriction. On the other hand, a user datatype cannot be defined as
having no whitespace replacement if it is derived from any predefined
datatype other than xs:string.
As expected, a string is a set of characters matching the definition
given by XML 1.0, namely, "legal characters are tab,
carriage return, line feed, and the legal characters of Unicode and
ISO/IEC 10646."
The value of the following element:
<title lang="en">
Being a Dog Is
a Full-Time Job
</title>
is the full string:
Being a Dog Is
a Full-Time Job
with all its tabs, and CR/LF if the title element is a type
xs:string.
4.3.2. Normalized Strings
- xs:normalizedString
-
The normalized string is the only predefined datatype in which
whitespace replacement is performed without collapsing.
The lexical space of xs:normalizedString is the same as the
lexical space of xs:string from which it is
derived--except that since any occurrence of #x9 (tab), #xA
(linefeed), and #xD (carriage return) are replaced by a #x20 (space),
these three characters cannot be found in its lexical and value
spaces.
The value of the same element:
<title lang="en">
Being a Dog Is
a Full-Time Job
</title>
is now the string:
Being a Dog Is a Full-Time Job
in which all the whitespaces have been replaced by spaces if the
title element is a type xs:normalizedString.
TIP:
There is no additional constraint on normalized strings. Any value
that is a valid xs:string is also a valid xs:normalizedString. The difference is the whitespace processing
that is applied when the lexical value is calculated.
4.3.3. Collapsed Strings
Whitespace collapsing is performed after
whitespace replacement by trimming the leading and trailing spaces
and replacing all the contiguous occurrences of spaces with a single
space. All the predefined datatypes (except, as we have seen, xs:string and xs:normalizedString) are
whitespace collapsed.
We will classify tokens, binary formats, URIs, qualified names,
notations, and all their derived types under this category. Although
these datatypes share a number of properties, we must stress again
that this categorization is done for the purpose of explanation and
does not directly appear in the Recommendation.
4.3.3.1. Tokenss
- xs:token
-
xs:token is xs:normalizedString on which the
whitespaces have been collapsed. Since whitespaces are accepted in
the lexical space of xs:token, this type is
better described as a "
tokenized"
string than as a "token"!
The same element:
<title lang="en">
Being a Dog Is
a Full-Time Job
</title>
is still a valid xs:token, and its value is now
the string:
Being a Dog Is a Full-Time Job
in which all the whitespaces have been replaced by spaces, any
trailing spaces are removed, and contiguous sequences of spaces are
replaced by single spaces.
TIP:
As is the case with xs:normalizedString, there is no
constraint on xs:token, and any value that is a
valid xs:string is also a valid xs:token. The difference is the whitespace processing
that is applied when the lexical value is calculated. This is not
true of derived datatypes that have additional constraints on their
lexical and value space. The restriction on the lexical spaces of
xs:normalizedString is, therefore, a restriction by
projection of their parsed space (different values of their parsed
space are transformed into a single value of their lexical space),
and not a restriction by invalidating values of their lexical space,
as is the case for all the other predefined datatypes.
The predefined datatypes derived from xs:token
are xs:language, xs:NMTOKEN, and
xs:Name.
- xs:language
-
This was
created to accept all the language codes standardized by RFC 1766.
Some valid values for this datatype are en,
en-US, fr, or
fr-FR.
- xs:NMTOKEN
-
This
corresponds to the XML 1.0
"Nmtoken" (Name token) production,
which is a single token (a set of characters without spaces) composed
of characters allowed in XML name. Some valid values for this
datatype are
"Snoopy",
"CMS",
"1950-10-04", or
"0836217462".
Invalid values include "brought classical
music to the Peanuts strip" (spaces are
forbidden) or
"bold,brash"
(commas are forbidden).
- xs:Name
-
This is
similar to xs:NMTOKEN with the additional
restriction that the values must start with a letter or the
characters ":" or
"-". This datatype conforms to the
XML 1.0 definition of a "Name."
Some valid values for this datatype are Snoopy,
CMS, or -1950-10-04-10:00.
Invalid values include 0836217462 (xs:Name cannot start with a number) or
bold,brash (commas are forbidden). This datatype
should not be used for names that may be
"qualified" by a namespace prefix,
since we will see another datatype (xs:QName)
that has a specific semantic for these values.The datatype xs:NCName is derived from xs:Name.
- xs:NCName
-
This is the
"noncolonized name" defined
by Namespaces in XML1.0, i.e., a xs:Name without
any colons (":"). As such, this
datatype is probably the predefined datatype that is closest to the
notion of a "name" in most of the
programming languages, even though some characters such as
"-" or
"." may still be a problem in many
cases. Some valid values for this datatype are
Snoopy, CMS,
-1950-10-04-10-00, or
1950-10-04. Invalid values include
-1950-10-04:10-00 or bold:brash
(colons are forbidden). xs:ID, xs:IDREF, and xs:ENTITY are derived
from xs:NCName.
- xs:ID
-
This is derived
from xs:NCName. There is one constraint added to
its value space is that there must not be any duplicate values in a
document. In other words, the values of attributes or simple type
elements having this datatype can be used as unique identifiers, and
this datatype emulates the XML 1.0 ID attribute type. We will see
this feature in more detail in Chapter 9, "Defining Uniqueness, Keys, and Key References".
- xs:IDREF
-
This is
derived from xs:NCName. The constraint added to
its value space is it must match an ID defined in the same document.
I will explain this feature in more detail in Chapter 9, "Defining Uniqueness, Keys, and Key References".
- xs:ENTITY
-
Also
provided for compatibility with XML 1.0 DTDs, this is derived from
xs:NCName and must match an
unparsed entity defined in a DTD.
TIP:
XML 1.0 gives the following definition of unparsed entities:
"an unparsed entity is a resource whose contents may
or may not be text, and if text, may be other than XML. Each unparsed
entity has an associated notation, identified by name. Beyond a
requirement that an XML processor make the identifiers for the entity
and notation available to the application, XML places no constraints
on the contents of unparsed entities." In practice,
this mechanism has seldom been used, as the general usage is to
define links to the resources that could be defined as unparsed
entities.
4.3.3.2. Qualified names
- xs:QName
-
Following
Namespaces in XML 1.0, xs:QName supports the use
of namespace-prefixed names. A namespace prefix xs:QName treats a
shortcut to identify a URI. Each xs:QName effectively contains a set
of tuples {namespace name, local part}, in which the namespace name
is the URI associated to the prefix through a namespace declaration.
Even though the lexical space of xs:QName is very
close to the lexical space of xs:Name (the only
constraint on the lexical space is that there is a maximum of one
colon allowed in an xs:QName, which cannot be the
first character), the value spaces of these datatypes are completely
different (a scalar for xs:Name and a tuple for
xs:QName) and xs:QName is
defined as a primitive datatype. The constraint added by this
datatype over an xs:Name is the prefix must be
defined as a namespace prefix in the scope of the element in which
this datatype is used.
W3C XML Schema itself has already given us some examples of QNames.
When we write <xs:attribute name="lang"
type="xs:language"/>, the type attribute is an
xs:QName and its value is the tuple:
{"http://www.w3.org/2001/XMLSchema", "language"}
because the URI:
"http://www.w3.org/2001/XMLSchema"
was assigned to the prefix
"xs:". If there
is no namespace declaration for this prefix, the type attribute is
considered invalid.
The prefix of an xs:QName is optional. We are
also able to write:
<xs:element ref="book" maxOccurs="unbounded"/>
in which the ref attribute is also a xs:QName and
its value the tuple:
{NULL, "book"}
because we haven't defined any default namespace.
xs:QName does support default namespaces; if a
default namespace is defined in the scope of this element, the value
of its URI is used for this tuple.
4.3.3.3. URIs
- xs:anyURI
-
This is another string datatype in
which lexical and value spaces are different. This datatype tries to
compensate for the differences of format between XML and URIs as
specified in the RFCs 2396 and 2732. These RFCs are not very friendly
toward non-ASCII characters and require many character escapings that
are not necessary in XML. The W3C XML Schema Recommendation
doesn't describe the transformation to perform,
noting only that it is similar to what is described for XLink link
locators.
As an example of this transformation, the href
attribute of an XHTML link written as:
<a href="http://dmoz.org/World/Français/">
Word/Français
</a>
would be converted to the value:
http://dmoz.org/World/Fran%e7ais/
in the value space.
The xs:anyURI datatype doesn't
pay any attention to xml:base attributes that may
have been defined in the document.
4.3.3.4. Notations
- xs:NOTATION
-
This is probably
the most obscure of these string datatypes. This datatype was created
to implement the XML 1.0 notations. It cannot be used directly in a
schema; it must be used through user-defined derived datatypes. We
will see more of it in the next chapter.
4.3.3.5. Binary string-encoded datatypes
XML 1.0 is unable to hold binary content,
which must be string-encoded before it can be included in a XML
document. W3C XML Schema has defined two primary datatypes to support
two encodings that are commonly used (BinHex and base64). These
encodings may be used to include any binary content, including text
formats whose content may be incompatible with the XML markup. Other
binary text encodings may also be used (such as uuXXcode, Quote
Printable, BinHex, aencode, or base85, to name a few), but their
value would not be recognized by W3C XML Schema.
- xs:hexBinary
-
This defines a simple way to code binary
content as a character string by translating the value of each binary
octet into two hexadecimal digits. This encoding is different from
the encoding method called BinHex (introduced by Apple, described by
RFC 1741, and includes a mechanism to compress repetitive
characters).
A UTF-8 XML header such as:
<?xml version="1.0" encoding="UTF-8"?>
that is encoded as xs:hexBinary would be:
3f3c6d78206c657673726f693d6e3122302e20226e656f636964676e223d54552d4622383e3f
- xs:base64Binary
-
This matches the encoding known as
"base64" and is described in RFC
2045. It maps groups of 6 bits into an array of 64 printable
characters.
The same header encoded as xs:base64Binary would be:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCg==
The W3C XML Schema Recommendation missed the fact that RFC 2045
requests a line break every 76 characters. This should be clarified
in an errata. The consequence of these line breaks being thought of
as optional by W3C XML Schema, is that the lexical and value spaces
of xs:base64Binary cannot be considered identical.
| | | 4.2. Whitespace Processing | | 4.4. Numeric Datatypes |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|