Our First Schema (XML Schema)

2.2. Our First Schema

We will see, in the course of this book, that there are many different styles for writing a schema, and there are even more approaches to deriving a schema from an instance document. For our first schema, we will adopt a style that is familiar to those of you who have already worked with DTDs. We'll start by creating a classified list of the elements and attributes found in the schema.

The elements existing in our instance document are author, book, born, character, dead, isbn, library, name, qualification, and title, and the attributes are available, id, and lang.

We will build our first schema by defining each element in turn under our schema document element (named, unsurprisingly, schema), which belongs to the W3C XML Schema namespace (http://www.w3.org/2001/XMLSchema) and is usually prefixed as "xs."

Before we start, we need to classify the elements and, for this exercise, give some key definitions for understanding how W3C XML Schema does this classification. (We will see these definitions in more detail in the chapters about simple and complex types.)

The content model characterizes the types of children elements and text nodes that can be included in an element (without paying any attention to the attributes).

The content model is said to be "empty" when no children elements nor text nodes are expected, "simple" when only text nodes are accepted, "complex" when only subelements are expected, and "mixed" when both text nodes and sub-elements can be present. Note that to determine the content model, we pay attention only to the element and text nodes and ignore any attribute, comment, or processing instruction that could be included. For instance, an element with some attributes, a comment, and a couple of processing instructions would have an "empty" content model if it has no text or element children.

Elements such as name, born, and title have simple content models:

.../...
        
  <title lang="en">
    Being a Dog Is a Full-Time Job
  </title>
.../...
        
  <name>
    Charles M Schulz
  </name>
        
  <born>
    1922-11-26
  </born>
.../...

Elements such as library or character have complex content models:

<library>
  <book id="b0836217462" available="true">
    .../...
  </book>
</library>

              
<character id="Lucy">
  <name>
    Lucy
  </name>
  <born>
    1952-03-03
  </born>
  <qualification>
    bossy, crabby and selfish
  </qualification>
</character>

Within elements that have a simple content model, we can distinguish those which have attributes and those which cannot have any attributes. Later chapters discuss how W3C XML Schema can also represent empty and mixed content models.

W3C XML Schema considers the elements that have a simple content model and no attributes "simple types," while all the other elements (such as simple content with attributes and other content models) are "complex types." In other words, when an element can only have text nodes and doesn't accept any child elements or attributes, it is considered a simple type; in all the other cases, it is a complex type.

Attributes always have a simple type since they have no children and contain only a text value.

In our example, elements such as author or title have a complex type:

  <author id="CMS">
    <name>
      Charles M Schulz
    </name>
    <born>
      1922-11-26
    </born>
    <dead>
      2000-02-12
    </dead>
  </author>
.../...
              
  <title lang="en">
    Being a Dog Is a Full-Time Job
  </title>

While elements such as born or qualification (and, of course, all the attributes) have a simple type:

  <born>
    1922-11-26
  </born>
.../...
                        
  <qualification>
    brought classical music to the Peanuts strip
  </qualification>
.../... 

  <book available="true"/>

Now that we have criteria to classify our components, we can define each of them. Let's start with the simplest one by taking a type element, such as the name element that can be found in author or character:

<name>
  Charles M Schulz
</name>

To define such an element, we use an xs:element(global definition), included directly under the xs:schema document element:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:string"/>
  .../...
</xs:schema>

The value used to reference the datatype (xs:string) is prefixed by xs, the prefix associated with W3C XML Schema. This means that xs:string is a predefined W3C XML Schema datatype.

The same can be done for all the other simple types as well as for the attributes:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:string"/>
  <xs:element name="qualification" type="xs:string"/>
  <xs:element name="born" type="xs:date"/>
  <xs:element name="dead" type="xs:date"/>
  <xs:element name="isbn" type="xs:string"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="xs:language"/>
  .../...
</xs:schema>

While we said that this design style would be familiar to DTD users, we must note that it is flatter than a DTD since the declaration of the attributes is done outside of the declaration of the elements. This results in a schema in which elements and attributes get fairly equal treatment. We will see, though, that when a schema describes an XML vocabulary that uses a namespace, this simple flat style is impossible to use most of time.

NOTE: The assimilation of simple type elements and attributes is a simplification compared to the XPath, DOM, and Infoset data models. These consider a simple type element to be an item having a single child item of type "character," and an attribute to be an item having a normalized value. The benefit of this simplification is we can use simple datatypes to define simple type elements and attributes indifferently and write in a consistent fashion:
  <xs:element name="isbn" type="xs:string"/>
                or
  <xs:attribute name="isbn" type="xs:string"/>

The order of the definitions in a schema isn't significant; we can now take the next step in terms of type complexity and define the title element that appears in the instance document as:

<title lang="en">
  Being a Dog Is a Full-Time Job
</title>

Since this element has an attribute, it has a complex type. Since it has only a text node, it is considered to have a simple content. We will, therefore, write its definition as:

<xs:element name="title">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attribute ref="lang"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>

The XML syntax makes it verbose, but this can almost be read as plain English as "the element named title has a complex type which is a simple content obtained by extending the predefined datatype xs:string by adding the attribute defined in this schema and having the name lang."

The remaining elements (library, book, author, and character) are all complex types with complex content. They are defined by defining the sequence of elements and attributes that will compose them.

The library element, the most straightforward of them, is defined as:

<xs:element name="library">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="book" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

This definition can be read as "the element named library is a complex type composed of a sequence of 1 to many occurrences (note the maxOccurs attribute) of elements defined as having a name book."

The element author, which has an attribute and for which we may consider the date of death as optional, could be:

<xs:element name="author">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element ref="born"/>
      <xs:element ref="dead" minOccurs="0"/>
    </xs:sequence>
    <xs:attribute ref="id"/>
  </xs:complexType>
</xs:element>

This means the element named author is a complex type composed of a sequence of three elements (name, born, and dead), and id. The dead element is optional- it may occur zero times.

The minOccurs and maxOccurs attributes, which we have seen in a couple of previous elements, allow us to define the minimum and maximum number of occurrences. Their default value is 1, which means that when they are both missing, the element must appear exactly one time in the sequence. The special value "unbounded" may be used for maxOccurs when the maximum number of occurrences is unlimited.

The attributes need to be defined after the sequence. The remaining elements (book and character) can be defined in the same way, which leads us to the following full schema:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="name" type="xs:string"/>
  <xs:element name="qualification" type="xs:string"/>
  <xs:element name="born" type="xs:date"/>
  <xs:element name="dead" type="xs:date"/>
  <xs:element name="isbn" type="xs:string"/>
  <xs:attribute name="id" type="xs:ID"/>
  <xs:attribute name="available" type="xs:boolean"/>
  <xs:attribute name="lang" type="xs:language"/>
  <xs:element name="title">
    <xs:complexType>
      <xs:simpleContent>
        <xs:extension base="xs:string">
          <xs:attribute ref="lang"/>
        </xs:extension>
      </xs:simpleContent>
    </xs:complexType>
  </xs:element>
  <xs:element name="library">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="book" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="author">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="dead" minOccurs="0"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="book">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="isbn"/>
        <xs:element ref="title"/>
        <xs:element ref="author" minOccurs="0" maxOccurs="unbounded"/> 
        <xs:element ref="character" minOccurs="0"
          maxOccurs="unbounded"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
      <xs:attribute ref="available"/>
    </xs:complexType>
  </xs:element>
  <xs:element name="character">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="name"/>
        <xs:element ref="born"/>
        <xs:element ref="qualification"/>
      </xs:sequence>
      <xs:attribute ref="id"/>
    </xs:complexType>
  </xs:element>
</xs:schema>