String Datatypes (XML Schema)

This section discusses datatypes derived from the xs:string primitive datatype as well as other datatypes that have a similar behavior (namely, xs:hexBinary, xs:base64Binary, xs:anyURI, xs:QName, and xs:NOTATION). These types are not expected to carry any quantifiable value (W3C XML Schema doesn't even expect to be able to sort them) and their value space is identical to their lexical space except when explicitly described otherwise. One should note that even though they are grouped in this section because they have a similar behavior, these primitive datatypes are considered quite different by the Recommendation.

4.3.3. Collapsed Strings

Whitespace collapsing is performed after whitespace replacement by trimming the leading and trailing spaces and replacing all the contiguous occurrences of spaces with a single space. All the predefined datatypes (except, as we have seen, xs:string and xs:normalizedString) are whitespace collapsed.

We will classify tokens, binary formats, URIs, qualified names, notations, and all their derived types under this category. Although these datatypes share a number of properties, we must stress again that this categorization is done for the purpose of explanation and does not directly appear in the Recommendation.

4.3.3.1. Tokenss

xs:token

xs:token is xs:normalizedString on which the whitespaces have been collapsed. Since whitespaces are accepted in the lexical space of xs:token, this type is better described as a " tokenized" string than as a "token"!

The same element:

<title lang="en">
   Being a Dog Is 
   a Full-Time Job
</title>

is still a valid xs:token, and its value is now the string:

Being a Dog Is a Full-Time Job

in which all the whitespaces have been replaced by spaces, any trailing spaces are removed, and contiguous sequences of spaces are replaced by single spaces.

TIP: As is the case with xs:normalizedString, there is no constraint on xs:token, and any value that is a valid xs:string is also a valid xs:token. The difference is the whitespace processing that is applied when the lexical value is calculated. This is not true of derived datatypes that have additional constraints on their lexical and value space. The restriction on the lexical spaces of xs:normalizedString is, therefore, a restriction by projection of their parsed space (different values of their parsed space are transformed into a single value of their lexical space), and not a restriction by invalidating values of their lexical space, as is the case for all the other predefined datatypes.

The predefined datatypes derived from xs:token are xs:language, xs:NMTOKEN, and xs:Name.

xs:language

This was created to accept all the language codes standardized by RFC 1766. Some valid values for this datatype are en, en-US, fr, or fr-FR.

xs:NMTOKEN

This corresponds to the XML 1.0 "Nmtoken" (Name token) production, which is a single token (a set of characters without spaces) composed of characters allowed in XML name. Some valid values for this datatype are "Snoopy", "CMS", "1950-10-04", or "0836217462". Invalid values include "brought classical music to the Peanuts strip" (spaces are forbidden) or "bold,brash" (commas are forbidden).

xs:Name

This is similar to xs:NMTOKEN with the additional restriction that the values must start with a letter or the characters ":" or "-". This datatype conforms to the XML 1.0 definition of a "Name." Some valid values for this datatype are Snoopy, CMS, or -1950-10-04-10:00. Invalid values include 0836217462 (xs:Name cannot start with a number) or bold,brash (commas are forbidden). This datatype should not be used for names that may be "qualified" by a namespace prefix, since we will see another datatype (xs:QName) that has a specific semantic for these values.The datatype xs:NCName is derived from xs:Name.

xs:NCName

This is the "noncolonized name" defined by Namespaces in XML1.0, i.e., a xs:Name without any colons (":"). As such, this datatype is probably the predefined datatype that is closest to the notion of a "name" in most of the programming languages, even though some characters such as "-" or "." may still be a problem in many cases. Some valid values for this datatype are Snoopy, CMS, -1950-10-04-10-00, or 1950-10-04. Invalid values include -1950-10-04:10-00 or bold:brash (colons are forbidden). xs:ID, xs:IDREF, and xs:ENTITY are derived from xs:NCName.

xs:ID

This is derived from xs:NCName. There is one constraint added to its value space is that there must not be any duplicate values in a document. In other words, the values of attributes or simple type elements having this datatype can be used as unique identifiers, and this datatype emulates the XML 1.0 ID attribute type. We will see this feature in more detail in Chapter 9, "Defining Uniqueness, Keys, and Key References".

xs:IDREF

This is derived from xs:NCName. The constraint added to its value space is it must match an ID defined in the same document. I will explain this feature in more detail in Chapter 9, "Defining Uniqueness, Keys, and Key References".

xs:ENTITY

Also provided for compatibility with XML 1.0 DTDs, this is derived from xs:NCName and must match an unparsed entity defined in a DTD.

TIP: XML 1.0 gives the following definition of unparsed entities: "an unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML. Each unparsed entity has an associated notation, identified by name. Beyond a requirement that an XML processor make the identifiers for the entity and notation available to the application, XML places no constraints on the contents of unparsed entities." In practice, this mechanism has seldom been used, as the general usage is to define links to the resources that could be defined as unparsed entities.

4.3.3.2. Qualified names

xs:QName

Following Namespaces in XML 1.0, xs:QName supports the use of namespace-prefixed names. A namespace prefix xs:QName treats a shortcut to identify a URI. Each xs:QName effectively contains a set of tuples {namespace name, local part}, in which the namespace name is the URI associated to the prefix through a namespace declaration. Even though the lexical space of xs:QName is very close to the lexical space of xs:Name (the only constraint on the lexical space is that there is a maximum of one colon allowed in an xs:QName, which cannot be the first character), the value spaces of these datatypes are completely different (a scalar for xs:Name and a tuple for xs:QName) and xs:QName is defined as a primitive datatype. The constraint added by this datatype over an xs:Name is the prefix must be defined as a namespace prefix in the scope of the element in which this datatype is used.

W3C XML Schema itself has already given us some examples of QNames. When we write <xs:attribute name="lang" type="xs:language"/>, the type attribute is an xs:QName and its value is the tuple:

{"http://www.w3.org/2001/XMLSchema", "language"}

because the URI:

"http://www.w3.org/2001/XMLSchema"

was assigned to the prefix "xs:". If there is no namespace declaration for this prefix, the type attribute is considered invalid.

The prefix of an xs:QName is optional. We are also able to write:

<xs:element ref="book" maxOccurs="unbounded"/>

in which the ref attribute is also a xs:QName and its value the tuple:

{NULL, "book"}

because we haven't defined any default namespace. xs:QName does support default namespaces; if a default namespace is defined in the scope of this element, the value of its URI is used for this tuple.

4.3.3.3. URIs

xs:anyURI

This is another string datatype in which lexical and value spaces are different. This datatype tries to compensate for the differences of format between XML and URIs as specified in the RFCs 2396 and 2732. These RFCs are not very friendly toward non-ASCII characters and require many character escapings that are not necessary in XML. The W3C XML Schema Recommendation doesn't describe the transformation to perform, noting only that it is similar to what is described for XLink link locators.

As an example of this transformation, the href attribute of an XHTML link written as:

<a href="http://dmoz.org/World/Français/">
  Word/Français
</a>

would be converted to the value:

http://dmoz.org/World/Fran%e7ais/

in the value space.

The xs:anyURI datatype doesn't pay any attention to xml:base attributes that may have been defined in the document.

4.3.3.4. Notations

xs:NOTATION: This is probably the most obscure of these string datatypes. This datatype was created to implement the XML 1.0 notations. It cannot be used directly in a schema; it must be used through user-defined derived datatypes. We will see more of it in the next chapter.

4.3.3.5. Binary string-encoded datatypes

XML 1.0 is unable to hold binary content, which must be string-encoded before it can be included in a XML document. W3C XML Schema has defined two primary datatypes to support two encodings that are commonly used (BinHex and base64). These encodings may be used to include any binary content, including text formats whose content may be incompatible with the XML markup. Other binary text encodings may also be used (such as uuXXcode, Quote Printable, BinHex, aencode, or base85, to name a few), but their value would not be recognized by W3C XML Schema.

xs:hexBinary

This defines a simple way to code binary content as a character string by translating the value of each binary octet into two hexadecimal digits. This encoding is different from the encoding method called BinHex (introduced by Apple, described by RFC 1741, and includes a mechanism to compress repetitive characters).

A UTF-8 XML header such as:

<?xml version="1.0" encoding="UTF-8"?>

that is encoded as xs:hexBinary would be:

3f3c6d78206c657673726f693d6e3122302e20226e656f636964676e223d54552d4622383e3f

xs:base64Binary

This matches the encoding known as "base64" and is described in RFC 2045. It maps groups of 6 bits into an array of 64 printable characters.

The same header encoded as xs:base64Binary would be:

PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCg==

The W3C XML Schema Recommendation missed the fact that RFC 2045 requests a line break every 76 characters. This should be clarified in an errata. The consequence of these line breaks being thought of as optional by W3C XML Schema, is that the lexical and value spaces of xs:base64Binary cannot be considered identical.

4.3.3.1. Tokenss

4.3.3.2. Qualified names

4.3.3.3. URIs

4.3.3.4. Notations

4.3.3.5. Binary string-encoded datatypes

4.3. String Datatypes

Figure 4-2. Strings and similar datatypes

4.3.1. No Whitespace Replacement

4.3.2. Normalized Strings

4.3.3. Collapsed Strings