home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Book HomeWebmaster in a Nutshell, 3rd EditionSearch this book

10.4. Document Type Definitions

A DTD specifies how elements inside an XML document should relate to each other. It also provides grammar rules for the document and each of its elements. A document adhering to the XML specifications and the rules outlined by its DTD is considered to be valid. (Don't confuse this with a well-formed document, which adheres only to the XML syntax rules outlined earlier.)

10.4.1. Element Declarations

You must declare each of the elements that appear inside your XML document within your DTD. You can do so with the <!ELEMENT> declaration, which uses this format:

<!ELEMENT elementname rule>

This declares an XML element and an associated rule called a content model, which relates the element logically to the XML document. The element name should not include <> characters. An element name must start with a letter or an underscore. After that, it can have any number of letters, numbers, hyphens, periods, or underscores in its name. Element names may not start with the string xml in any variation of upper- or lowercase. You can use a colon in element names only if you use namespaces; otherwise, it is forbidden.

10.4.2. ANY and PCDATA

The simplest element declaration states that between the opening and closing tags of the element, anything can appear:

<!ELEMENT library ANY>

The ANY keyword allows you to include other valid tags and general character data within the element. However, you may want to specify a situation where you want only general characters to appear. This type of data is better known as parsed character data, or PCDATA. You can specify that an element contain only PCDATA with a declaration such as the following:

<!ELEMENT title (#PCDATA)>

Remember, this declaration means that any character data that is not an element can appear between the element tags. Therefore, it's legal to write the following in your XML document:

<title></title>
<title>Webmaster in a Nutshell</title>
<title>Java Network Programming</title>

However, the following is illegal with the previous PCDATA declaration:

<title>
Webmaster <emphasis>in a Nutshell</emphasis>
</title>

On the other hand, you may want to specify that another element must appear between the two tags specified. You can do this by placing the name of the element in the parentheses. The following two rules state that a <books> element must contain a <title> element, and a <title> element must contain parsed character data (or null content) but not another element:

<!ELEMENT books (title)>
<!ELEMENT title (#PCDATA)>

10.4.2.1. Multiple sequences

If you wish to dictate that multiple elements must appear in a specific order between the opening and closing tags of a specific element, you can use a comma (,) to separate the two instances:

<!ELEMENT books (title, authors)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT authors (#PCDATA)>

In the preceding declaration, the DTD states that within the opening <books> and closing </books> tags, there must first appear a <title> element consisting of parsed character data. It must be immediately followed by an <authors> element containing parsed character data. The <authors> element cannot precede the <title> element.

Here is a valid XML document for the DTD excerpt defined previously:

<books>
   <title>Webmaster in a Nutshell, Third Edition</title>
   <authors>Stephen Spainhour and Robert Eckstein</authors>
</books>

The previous example showed how to specify both elements in a declaration. You can just as easily specify that one or the other appear (but not both) by using the vertical bar (|):

<!ELEMENT books (title|authors)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT authors (#PCDATA)>

This declaration states that either a <title> element or an <authors> element can appear inside the <books> element. Note that it must have one or the other. If you omit both elements or include both elements, the XML document is not considered valid. You can, however, use a recurrence operator to allow such an element to appear more than once. Let's talk about that now.

10.4.2.2. Grouping and recurrence

You can nest parentheses inside your declarations to give finer granularity to the syntax you're specifying. For example, the following DTD states that inside the <books> element, the XML document must contain either a <description> element or a <title> element immediately followed by an <author> element. All three elements must consist of parsed character data:

<!ELEMENT books ((title, author)|description)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT description (#PCDATA)>

Now for the fun part: you are allowed to dictate inside an element declaration whether a single element (or a grouping of elements contained inside parentheses) must appear zero or one times, one or more times, or zero or more times. The characters used for this appear immediately after the target element (or element grouping) that they refer to and should be familiar to Unix shell programmers. Occurrence operators are shown in the following table:

Attribute

Description

?

Must appear once or not at all (zero or one times)

+

Must appear at least once (one or more times)

*

May appear any number of times or not at all (zero or more times)

If you want to provide finer granularity to the <author> element, you can redefine the following in the DTD:

<!ELEMENT author (authorname+)>
<!ELEMENT authorname (#PCDATA)>

This indicates that the <author> element must have at least one <authorname> element under it. It is allowed to have more than one as well. You can define more complex relationships with parentheses:

<!ELEMENT reviews (rating, synopsis?, comments+)*>
<!ELEMENT rating ((tutorial|reference)*, overall)>
<!ELEMENT synopsis (#PCDATA)>
<!ELEMENT comments (#PCDATA)>
<!ELEMENT tutorial (#PCDATA)>
<!ELEMENT reference (#PCDATA)>
<!ELEMENT overall (#PCDATA)>

10.4.2.3. Mixed content

Using the rules of grouping and recurrence to their fullest allows you to create very useful elements that contain mixed content. Elements with mixed content contain child elements that can intermingle with PCDATA. The most obvious example of this is a paragraph:

<para>
This is a <emphasis>paragraph</emphasis> element. It
contains this <link ref="http://www.w3.org">link</link>
to the W3C. Their website is <emphasis>very</emphasis>
helpful.
</para>

Mixed content declarations look like this:

<!ELEMENT quote (#PCDATA|name|joke|soundbite)*>

This declaration allows a <quote> element to contain text (#PCDATA), <name> elements, <joke> elements, and/or <soundbite> elements in any order. You can't specify things such as:

<!ELEMENT memo (#PCDATA, from, #PCDATA, to, content)>

Once you include #PCDATA in a declaration, any following elements must be separated by or bars (|), and the grouping must be optional and repeatable (*).

10.4.2.4. Empty elements

You must also declare each of the empty elements that can be used inside a valid XML document. This can be done with the EMPTY keyword:

<!ELEMENT elementname EMPTY>

For example, the following declaration defines an element in the XML document that can be used as <statuscode/> or <statuscode></statuscode>:

<!ELEMENT statuscode EMPTY>

10.4.3. Entities

Inside a DTD, you can declare an entity, which allows you to use an entity reference to substitute a series of characters for another character in an XML document—similar to macros.

10.4.3.1. General entities

A general entity is an entity that can substitute other characters inside the XML document. The declaration for a general entity uses the following format:

<!ENTITY name "replacement_characters">

We have already seen five general entity references, one for each of the characters <, >, &, ', and ". Each of these can be used inside an XML document to prevent the XML processor from interpreting the characters as markup. (Incidentally, you do not need to declare these in your DTD; they are always provided for you.)

Earlier, we provided an entity reference for the copyright character. We could declare such an entity in the DTD with the following:

<!ENTITY copyright "&#xA9;">

Again, we have tied the &copyright; entity to Unicode value 169 (or hexadecimal 0xA9), which is the circled-C (©) copyright character. You can then use the following in your XML document:

<copyright>
&copyright; 2001 by MyCompany, Inc.
</copyright>

There are a couple of restrictions to declaring entities:

  • You cannot make circular references in the declarations. For example, the following is invalid:

    <!ENTITY entitya "&entityb; is really neat!">
    <!ENTITY entityb "&entitya; is also really neat!">
  • You cannot substitute nondocument text in a DTD with a general entity reference. The general entity reference is resolved only in an XML document, not a DTD document. (If you wish to have an entity reference resolved in the DTD, you must instead use a parameter entity reference.)

10.4.3.2. Parameter entities

Parameter entity references appear only in DTDs and are replaced by their entity definitions in the DTD. All parameter entity references begin with a percent sign, which denotes that they cannot be used in an XML document—only in the DTD in which they are defined. Here is how to define a parameter entity:

<!ENTITY % name "replacement_characters">

Here are some examples using parameter entity references:

<!ENTITY % pcdata "(#PCDATA)">
<!ELEMENT authortitle %pcdata;>

As with general entity references, you can't make circular references in declarations. In addition, parameter entity references must be declared before they can be used.

10.4.3.4. Unparsed entities

By the same token, you can use an unparsed entity to declare non-XML content in an XML document. For example, if you want to declare an outside image to be used inside an XML document, you can specify the following in the DTD:

<!ENTITY image1 SYSTEM
      "http://www.oreilly.com/ora.gif" NDATA GIF89a>

Note that we also specify the NDATA (notation data) keyword, which tells exactly what type of unparsed entity the XML processor is dealing with. You typically use an unparsed entity reference as the value of an element's attribute, one defined in the DTD with the type ENTITY or ENTITIES. Here is how you should use the unparsed entity declared previously:

<image src="image1"/>

Note that we did not use an ampersand (&) or a semicolon (;). These are only used with parsed entities.

10.4.3.5. Notations

Finally, notations are used in conjunction with unparsed entities. A notation declaration simply matches the value of an NDATA keyword (GIF89a in our example) with more specific information. Applications are free to use or ignore this information as they see fit:

<!NOTATION GIF89a SYSTEM "-//CompuServe//NOTATION
      Graphics Interchange Format 89a//EN">

10.4.4. Attribute Declarations in the DTD

Attributes for various XML elements must be specified in the DTD. You can specify each of the attributes with the <!ATTLIST> declaration, which uses the following form:

<!ATTLIST target_element attr_name attr_type default>

The <!ATTLIST> declaration consists of the target element name, the name of the attribute, its datatype, and any default value you want to give it.

Here are some examples of legal <!ATTLIST> declarations:

<!ATTLIST box length CDATA "0">
<!ATTLIST box width CDATA "0">
<!ATTLIST frame visible (true|false) "true">
<!ATTLIST person marital
     (single | married | divorced | widowed) #IMPLIED>

In these examples, the first keyword after ATTLIST declares the name of the target element (i.e., <box>, <frame>, <person>). This is followed by the name of the attribute (i.e., length, width, visible, marital). This, in turn, is generally followed by the datatype of the attribute and its default value.

10.4.4.2. Datatypes

The following table lists legal datatypes to use in a DTD.

Type

Description

CDATA

Character data

enumerated

A series of values from which only one can be chosen

ENTITY

An entity declared in the DTD

ENTITIES

Multiple whitespace-separated entities declared in the DTD

ID

A unique element identifier

IDREF

The value of a unique ID type attribute

IDREFS

Multiple whitespace-separated IDREFs of elements

NMTOKEN

An XML name token

NMTOKENS

Multiple whitespace-separated XML name tokens

NOTATION

A notation declared in the DTD

The CDATA keyword simply declares that any character data can appear, although it must adhere to the same rules as the PCDATA tag. Here are some examples of attribute declarations that use CDATA:

<!ATTLIST person name CDATA #REQUIRED>
<!ATTLIST person email CDATA #REQUIRED>
<!ATTLIST person company CDATA #FIXED "OReilly">

Here are two examples of enumerated datatypes where no keywords are specified. Instead, the possible values are simply listed:

<!ATTLIST person marital
   (single | married | divorced | widowed) #IMPLIED>
<!ATTLIST person sex (male | female) #REQUIRED>

The ID, IDREF, and IDREFS datatypes allow you to define attributes as IDs and ID references. An ID is simply an attribute whose value distinguishes the current element from all others in the current XML document. IDs are useful for applications to link to various sections of a document that contain an element with a uniquely tagged ID. IDREFs are attributes that reference other IDs. Consider the following XML document:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE sector SYSTEM sector.dtd>
<sector>
   <employee empid="e1013">Jack Russell</employee>
   <employee empid="e1014">Samuel Tessen</employee>
   <employee empid="e1015" boss="e1013">
      Terri White</employee>
   <employee empid="e1016" boss="e1014">
      Steve McAlister</employee>
</sector>

and its DTD:

<!ELEMENT sector (employee*)>
<!ELEMENT employee (#PCDATA)>
<!ATTLIST employee empid ID #REQUIRED>
<!ATTLIST employee boss IDREF #IMPLIED>

Here, all employees have their own identification numbers (e1013, e1014, etc.), which we define in the DTD with the ID keyword using the empid attribute. This attribute then forms an ID for each <employee> element; no two <employee> elements can have the same ID.

Attributes that only reference other elements use the IDREF datatype. In this case, the boss attribute is an IDREF because it uses only the values of other ID attributes as its values. IDs will come into play when we discuss XLink and XPointer.

The IDREFS datatype is used if you want the attribute to refer to more than one ID in its value. The IDs must be separated by whitespace. For example, adding this to the DTD:

<!ATTLIST employee managers IDREFS #REQUIRED>

allows you to legally use the XML:

<employee empid="e1016" boss="e1014"
          managers="e1014 e1013">
    Steve McAllister
</employee>

The NMTOKEN and NMTOKENS attributes declare XML name tokens. An XML name token is simply a legal XML name that consists of letters, digits, underscores, hyphens, and periods. It can contain a colon if it is part of a namespace. It may not contain whitespace; however, any of the permitted characters for an XML name can be the first character of an XML name token (e.g., .profile is a legal XML name token, but not a legal XML name). These datatypes are useful if you enumerate tokens of languages or other keyword sets that match these restrictions in the DTD.

The attribute types ENTITY and ENTITIES allow you to exploit an entity declared in the DTD. This includes unparsed entities. For example, you can link to an image as follows:

<!ELEMENT image EMPTY>
<!ATTLIST image src ENTITY #REQUIRED>
<!ENTITY chapterimage SYSTEM "chapimage.jpg" NDATA "jpg">

You can use the image as follows:

<image src="chapterimage">

The ENTITIES datatype allows multiple whitespace-separated references to entities, much like IDREFS and NMTOKENS allow multiple references to their datatypes.

The NOTATION keyword simply expects a notation that appears in the DTD with a <!NOTATION> declaration. Here, the player attribute of the <media> element can be either mpeg or jpeg:

<!NOTATION mpeg SYSTEM "mpegplay.exe">
<!NOTATION jpeg SYSTEM "netscape.exe">
<!ATTLIST media player
      NOTATION (mpeg | jpeg) #REQUIRED>

Note that you must enumerate each of the notations allowed in the attribute. For example, to dictate the possible values of the player attribute of the <media> element, use the following:

<!NOTATION mpeg SYSTEM "mpegplay.exe">
<!NOTATION jpeg SYSTEM "netscape.exe">
<!NOTATION mov SYSTEM "mplayer.exe">
<!NOTATION avi SYSTEM "mplayer.exe">
<!ATTLIST media player
      NOTATIONS (mpeg | jpeg | mov) #REQUIRED>

Note that according the rules of this DTD, the <media> element is not allowed to play AVI files. The NOTATION keyword is rarely used.

Finally, you can place all the ATTLIST entries for an element inside a single ATTLIST declaration, as long as you follow the rules of each datatype:

<!ATTLIST person
          name CDATA #REQUIRED
          number IDREF #REQUIRED
          company CDATA #FIXED "OReilly">

10.4.5. Included and Ignored Sections

Within a DTD, you can bundle together a group of declarations that should be ignored using the IGNORE directive:

<![IGNORE[
   DTD content to be ignored
]]>

Conversely, if you wish to ensure that certain declarations are included in your DTD, use the INCLUDE directive, which has a similar syntax:

<![INCLUDE[
   DTD content to be included
]]>

Why you would want to use either of these declarations is not obvious until you consider replacing the INCLUDE or IGNORE directives with a parameter entity reference that can be changed easily on the spot. For example, consider the following DTD:

<?xml version="1.0" encoding="iso-8859-1"?>
<![%book;[
   <!ELEMENT text (chapter+)>
]]>
<![%article;[
   <!ELEMENT text (section+)>
]]>
<!ELEMENT chapter (section+)>
<!ELEMENT section (p+)>
<!ELEMENT p (#PCDATA)>

Depending on the values of the entities book and article, the definition of the text element will be different:

  • If book has the value INCLUDE and article has the value IGNORE, then the text element must include chapter s (which in turn may contain section s that themselves include paragraph s).

  • But if book has the value IGNORE and article has the value INCLUDE, then the text element must include section s.

When writing an XML document based on this DTD, you may write either a book or an article simply by properly defining book and article entities in the document's internal subset.

10.4.5.1. Internal subsets

You can place parts of your DTD declarations inside the DOCTYPE declaration of the XML document, as shown:

<!DOCTYPE boilerplate SYSTEM "generic-inc.dtd" [
   <!ENTITY corpname "Acme, Inc.">
]>

The region between brackets is called the DTD's internal subset. When a parser reads the DTD, the internal subset is read first, followed by the external subset, which is the file referenced by the DOCTYPE declaration.

There are restrictions on the complexity of the internal subset, as well as processing expectations that affect how you should structure it:

  • Conditional sections (such as INCLUDE or IGNORE) are not permitted in an internal subset.

  • Any parameter entity reference in the internal subset must expand to zero or more declarations. For example, specifying the following parameter entity reference is legal:

    %paradecl;

    as long as %paradecl; expands to the following:

    <!ELEMENT para CDATA>

    However, if you simply write the following in the internal subset, it is considered illegal because it does not expand to a whole declaration:

    <!ELEMENT para (%paracont;)>

Nonvalidating parsers aren't required to read the external subset and process its contents, but they are required to process any defaults and entity declarations in the internal subset. However, a parameter entity can change the meaning of those declarations in an unresolvable way. Therefore, a parser must stop processing the internal subset when it comes to the first external parameter entity reference that it does not process. If it's an internal reference, it can expand it, and if it chooses to fetch the entity, it can continue processing. If it does not process the entity's replacement, it must not process the attribute list or entity declarations in the internal subset.

Why use this? Since some entity declarations are often relevant only to a single document (for example, declarations of chapter entities or other content files), the internal subset is a good place to put them. Similarly, if a particular document needs to override or alter the DTD values it uses, you can place a new definition in the internal subset. Finally, in the event that an XML processor is nonvalidating (as we mentioned previously), the internal subset is the best place to put certain DTD-related information, such as the identification of ID and IDREF attributes, attribute defaults, and entity declarations.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.







??????????????@Mail.ru