Using Predefined Simple Datatypes (XML Schema)

W3C XML Schema provides an extensive set of predefined datatypes. W3C XML Schema derives many of these predefined datatypes from a smaller set of "primitive" datatypes that have a specific meaning and semantic and cannot be derived from other types. We will see how we can use these types to define our own datatypes by derivation to meet more specific needs in the next chapter.

Figure 4-1 provides a map of predefined datatypes and the relationships between them.

Figure 4-1. W3C XML Schema type hierarchy

4.1. Lexical and Value Spaces

W3C XML Schema introduced a decoupling between the data, as it can be read from the instance documents (the "lexical space"), and the value, as interpreted according to the datatype (the "value space").

Before we can enter into the definition of these two spaces, we must examine the processing model and the transformations endured by a value written in a XML document before it is validated. Element and attribute content proceeds through the following steps during processing:

Serialization space: The series of bytes that is actually stored in a document (either as the value of an attribute or as a text node) may be seen as belonging to a first space, which we may call the "serialization space."
Parsed space: The XML 1.0 Recommendation makes it clear that the serialization space is not directly meaningful to applications, and a first transformation is performed on the value by conforming XML parsers before the value reaches an application: characters are converted into Unicode, and ends of lines (for text nodes and attributes) and whitespaces (only for attributes) are normalized. The result of this transformation is what reaches the applications--including schema processors--and belongs to what we may call the "parsed space."
Lexical space: Before doing any validation, W3C XML Schema performs a second round of whitespace processing on this value reported by the XML parser. This depends on the value's datatype and may either ignore, normalize, or collapse the whitespaces. The value after this whitespace processing belongs to the "lexical space" defined in the W3C XML Schema Recommendation.
Value space: W3C XML Schema considers an item from the lexical space to be a representation of an abstract value whose meaning or semantic is defined by its datatype and can be a piece of text, and also a number, a date, or qualified name. The ensemble of abstract values is defined as the "value space."

Each datatype has its own lexical and value spaces and its own rules to associate a lexical representation with a value; for many datatypes, a single value can have multiple lexical representations (for instance, the <xs:float> value "3.14116" can also be written equivalently as "03.14116," "3.141160," or ".314116E1"). This distinction is important since the basic operations performed on the values (such as equality testing or sorting) are done on the value space. "3.14116" is considered to be equal to "03.14116" when the type is xs:float and is different when the type is xs:string. The same applies to sort orders: some datatypes have a full order relation (every pair of values can be compared), other have no order relation at all, and the remaining types have a partial order relation (values cannot always be compared).

TIP: Although future versions of APIs might send these values to the applications, the transformations between parsed, lexical, and value spaces are currently done for the sake of the validation only, and don't impact the values sent by a validating parser.

Chapter 4. Using Predefined Simple Datatypes

Contents:

Figure 4-1. W3C XML Schema type hierarchy

4.1. Lexical and Value Spaces