XPath-Based Identity Checks (XML Schema)

9.2. XPath-Based Identity Checks

The IDs and IDREFs are stored in the PSVI in a table (called the "ID/IDREF table") and can eventually be used by the applications to locate the corresponding nodes. We can expect XPath applications (including XPointer) to provide shortcuts and fast access to the nodes identified by W3C XML Schema, as is already the case with the DTD IDs.

Simple and easy to use within their domain, IDs and IDREFs keep the limitations of their DTDs ancestors. W3C XML Schema provides a more flexible feature for defining identity constraints without limitation on its lexical space and allowing local keys and references, as well as multinodes keys.

Another important difference is that the ID/IDREF checks are done on datatypes based on xs:NMTOKEN datatypes, while the checks that we will see hereafter can be performed on other datatypes, and the comparisons will be done on the actual value spaces rather than on their string representations from the lexical space. These checks are based on a set of XPath expressions and are defined through three different (but similar) constructs to test the uniqueness of a value, define a key, and define a key reference.

9.2.1. Uniqueness

The first of these constructs defines a simple check for uniqueness. We will spend some time explaining this in detail, since the two other constructs are based on the same pattern.

The definition of these constraints is done using two consecutive relative XPath expressions evaluated against the position of the element under which they are defined. We need a clear picture of the structure of the instance documents to define them. The starting point is the location of the element under which the check is defined. This location determines the scope of the test and must be carefully chosen, since it is the basis from which all the checks will be performed for this constraint.

For instance, in our library, we can choose to define a check for the uniqueness of the ISBN number of our books under the library element, since we need to check it within the scope of the whole library. However, within a book, we may also test that the reference to a character is unique within the scope of this book. We can define this second check inside the book element.

Once we have chosen the location of the test, we can start writing it at the end of the definition of the element:

<xs:element name="book" maxOccurs="unbounded">
  <xs:complexType>
    .../...
  </xs:complexType>
  <xs:unique name="book">
    .../...
  </xs:unique>
</xs:element>

The name attribute used here will be useful if we want to refer to this constraint through a keyref.

Now that we have defined the name and the root of the test, we will define the selector that is the relative path of the node being identified. In our example, the relative path to access a book element from library is book, so we write:

<xs:element name="library">
  <xs:complexType>
    .../...
  </xs:complexType>
  <xs:unique name="book">
    <xs:selector xpath="book"/>
    .../...
  </xs:unique>
</xs:element>

We have expressed the fact that a book must be unique within a library. To complete the description of this check, we need to define how a book is identified through field elements.

In our case, the identifier is the isbn subelement, and the complete definition is:

<xs:element name="book" maxOccurs="unbounded">
  <xs:complexType>
    .../...
  </xs:complexType>
  <xs:unique name="book">
    <xs:selector xpath="book"/>
    <xs:field xpath="isbn"/>
  </xs:unique>
</xs:element>

Translated into plain English, this definition can be read as "for each library, each book identified by its ISBN should be unique."

TIP: A unique condition doesn't impose that the node used as an identifier (the field) is required. Selectors whose field is not available are just ignored. To define the same check when the field is required, a "key" should be defined instead of "unique."

9.2.2. Composite Fields

If the names of our authors were split in our library into first, middle, and last names, we may find it convenient to define a composite field to identify our authors. W3C XML Schema provides this feature by allowing definition of several fields within a single constraint--for instance:

<xs:element name="library">
  <xs:complexType>
    .../...
  </xs:complexType>
  <xs:unique name="author">
    <xs:selector xpath="author"/>
    <xs:field xpath="first-name"/>
    <xs:field xpath="middle-name"/>
    <xs:field xpath="last-name"/>
  </xs:unique>
</xs:element>

The check is then done on the triple that is composed of the values of the three fields (first-name, middle-name, last-name) that need to be unique as a combination.

9.2.3. Keys

A key is a unique constraint with the additional restriction that all the nodes corresponding to all the fields are required.

The syntax for defining a key is the same as the syntax for defining a unique condition, except the unique element is replaced by a key element:

<xs:element name="library">
  <xs:complexType>
    .../...
  </xs:complexType>
  <xs:key name="book">
    <xs:selector xpath="book"/>
    <xs:field xpath="isbn"/>
  </xs:key>
</xs:element>

TIP: There is clearly an overlap between the additional existence check done by a key constraint and the other ways to control the number of occurrences of an element or attribute. In our example, if the minimum number of occurrences for the author's name is set to one, using xs:unique or xs:key is equivalent, except when the author's name can have a "nil" value. (We will discuss the "nil" value in Chapter 11, "Referencing Schemas and Schema Datatypes in XML Documents".)

9.2.4. Key References

Despite its name, xs:keyref can be used not only to define a reference to xs:key, but also to xs:unique.

The usage of xs:keyref is straightforward and similar to the usage of xs:key or xs:unique, with an important point worth mentioning: the refer attribute of xs:keyref should refer to a xs:key or xs:unique element defined under the same element or under one of their ancestors.

TIP: The reason for this rule is that the "identity-constraint tables" where the keys and references are stored are local to an element and its ancestors.

The definitions of matching xs:unique or xs:key and xs:keyref need to be done within the same element, or else one of its ancestors has an impact on the choice of this location. If, for instance, our books and authors are kept in separate sections of our document:

<library>
  <books>
    <book>
      .../...
      <author-ref ref="Charles M. Schulz"/>
      .../...
    </book>
    .../...
  </books>
  <authors>
    <author>
      <name>
        Charles M. Schulz
      </name>
      .../...
    </author>
    .../...
  </authors>
</library>

It's good practice to define a modular schema by locating the constraints as near as possible to the elements they control. A natural fit is to locate a key in the authors element and the matching keyref in the books element. However, since a xs:keyref needs to be in the same element as the matching xs:key or one of its ancestors, and books isn't an ancestor of authors, the xs:keyref definition can only be done in the library element. (The xs:key can be defined either in the library or in the authors element.)

In the previous example, locating the xs:key definition within library or authors was only a matter of style, since the authors are unique both within a library and within the authors elements. However, W3C XML Schema allows for situations in which this isn't the case and in which a key is unique within the scope of a subelement without being unique within the whole document.

Let's modify the previous example to define several categories of authors:

<library>
  <books>
    <book>
      .../...
      <author-ref ref="Charles M. Schulz"/>
      .../...
    </book>
    .../...
  </books>
  <authors>
    <category id="comics">
      <author>
        <name>
          Charles M. Schulz
        </name>
        .../...
      </author>
      .../...
    </category>
    <category id="novels">
      .../...
    </category>
    .../...
  </authors>
</library>

Defining a xs:key (or xs:unique) within library or authors specifies a uniqueness within the scope of the entire library. Defining a list of authors within category specifies a uniqueness within this category only, and allows authors with the same name to be defined under several categories.

It is perfectly valid, per W3C XML Schema, to define a xs:key under category and a matching xs:keyref under library (since library is an ancestor of category). By doing so, a new constraint is added to authors' names. When an author is referenced within a book, her name has to be unique within the scope of the xs:keyref. Applied to our instance document, this means that if "Charles M. Schulz" was not referenced in one of the books, he can be defined in several categories; since he is referenced in one book, his name must be defined once only.

TIP: While this behavior is described in the Recommendation, the results may be surprising for schema designers. It is probably good practice to keep the definitions of the xs:key (or xs:unique) and their matching xs:keyref in the same elements.

9.2.5. Permitted XPath Expressions

The W3C XML Schema Recommendation states that "to reduce the burden on implementers, in particular implementers of streaming processors, only restricted subsets of XPath expressions are allowed" in xs:selector and xs:field. The result of this statement is a limited subset of XPath that allows only the selection of nodes that are descendants of or are part of the current locations.

The XPath expressions allowed in xs:selector must exclusively go deeper into the hierarchy of the XML element nodes, do not allow any tests in the XPath steps, and must match a set of elements. In addition, the XPath expressions allowed in xs:field can also select attributes.

The full BNF for this subset is given in the reference guide. Rather than giving a verbose explanation, let's see some examples of what is possible and what is not.

The following are allowed:

xpath="author": Selects the child elements named author that do not belong to any namespace.
xpath="author|character": Selects the child elements named author or character that do not belong to any namespace.
xpath="lib:author": Selects the child elements named author that belong to the namespace whose prefix is "lib".
xpath="*": Selects all the child elements.
xpath="lib:*": Selects all the child elements that belong to the namespace whose prefix is "lib".
xpath="authors/author": Selects all the authors/author child elements.
xpath=".//author": Selects all the elements that are descendants of the current node, named author, and don't belong to any namespace.
xpath="author/@id": Selects the id attribute of the author child element (allowed only for xs:field, and not for xs:selector).
xpath="@id|@name": Selects @id or @name (valid only in xs:field, since attributes are forbidden in xs:selector).

The following are forbidden:

xpath="/library/author": Absolute paths are not allowed.
xpath="../author": The parent axis is not allowed.
xpath=".//*[@id]": Tests are not allowed.
xpath="author[@type='comics']": Tests are not allowed.
xpath="substring-after(@xlink:href, `#')": Function calls are not allowed.
xpath="//author": Absolute paths are not allowed.

TIP: Default namespaces do not apply within XPath expressions, and elements and attributes must always be qualified by a prefix if they belong to a namespace.