[Chapter 9] 9.3 The Document Type Definition (DTD)

9.3 The Document Type Definition (DTD)

If we're going to use XML to exchange documents electronically, we must be able to judge whether a document meets a certain set of necessary requirements. For example, an electronic invoice must, at minimum, include an invoice number, a date, and at least one item. Our systems should be smart enough to reject an invoice if it doesn't contain the required information. Additionally, we should be able to create these requirements ourselves.

You can associate a document type definition (DTD) with an XML document to enforce these sorts of rules. You can either create a DTD or use one that already exists. A major goal of XML is to encourage various groups (industry, community, academic, etc.) to form standards bodies to define collective DTDs. Eventually, these DTDs will form the basis for a variety of electronic data exchange systems.

A DTD is a lot like a database schema.[ 3 ] Just as you would define the columns in a database table, you can use a DTD to define the name and datatype of every element that can appear in an XML document. Just as you define a column constraint, you can require that particular elements appear within the document. Just as you would normalize a set of database tables into one-to-many or one-to-one relationships, you can create the same relationships by defining how the elements can be hierarchically nested.

[3] Oracle Corporation is an active participant in the World Wide Web Consortium's (W3C) "XML Schema" working group. The W3C oversees the development of almost all the major Internet standards.

Let's revisit the invoice example from the beginning of this chapter. If we were to simply model a basic invoice using an entity relationship diagram (ERD), we might wind up with something like Figure 9.2 .

Figure 9.2: An ERD for a simple invoice

We can use this diagram as a guide to constructing a corresponding DTD. For clarity, though, we'll start with the finished DTD and work backwards:

<!ELEMENT INVOICE (INVOICE_NUMBER, DATE, CUSTOMER+,INVOICE_ITEMS,TOTAL?)>
   <!ELEMENT INVOICE_NUMBER (#PCDATA)>
   <!ELEMENT DATE (#PCDATA)>
   <!ELEMENT CUSTOMER (#PCDATA)>
   <!ELEMENT INVOICE_ITEMS (ITEM+)>
      <!ELEMENT ITEM (ITEM_NAME, QUANTITY, PRICE)>
         <!ELEMENT ITEM_NAME (#PCDATA)>
         <!ATTLIST ITEM_NAME
            ITEM_NUM CDATA #REQUIRED>
      <!ELEMENT QUANTITY (#PCDATA)>
      <!ELEMENT PRICE (#PCDATA)>
   <!ELEMENT TOTAL (#PCDATA)>

As you can see from the example, the majority of the DTD consists of instructions to define the elements that can appear within an invoice. The first line defines the root element, INVOICE , the highest element in the nesting tree, as well as the names of all the elements that INVOICE can contain. A single character that indicates how often the element can appear follows each element declaration. Table 9.1 summarizes the function of each character.

Table 9.1: Characters Used to Define Element Occurrences
Character	Translation	Rough Database Equivalent
Blank	Element must appear exactly once.	Non-NULL column constraint
`?`	Element can appear 0 or 1 times.	Constraint/one-to-one relationship
`*`	Element can appear 0 or more times.	Constraint/one-to-many relationship
`+`	Element can appear 1 or more times.	Constraint/one-to-many relationship

As we can see from the preceding code example, the INVOICE must include an INVOICE_NUMBER , an invoice DATE , at least one CUSTOMER (the + character leaves open our double-billing options), and an INVOICE_ITEMS section. Finally, it can include an optional invoice TOTAL (why should you have to do all the work?).

Declarations for each of these elements follow the root declaration. The first four items are the simplest declaration, and consist of a name and a datatype. XML datatypes are much more limited than the standard NUMBER, VARCHAR2, and RAW types used to define table columns. The datatype used here (PCDATA ) tells the XML parser that the element consists of formatted text.

The next declaration, INVOICE_ITEMS , is an example of a nested element (notice how similar it is to the declaration for the root element.) The INVOICE_ITEMS section must contain at least one ITEM , which is itself a nested structure consisting of an ITEM_NAME , a QUANTITY , and a PRICE . As a final wrinkle, the ATTLIST command is used to further refine the <ITEM_NAME> tag by defining a tag attribute called ITEM_NUM .

That's it -- we've defined everything we need for our simple example: the name of each element, the number of times each element can appear, and the allowable nesting arrangements they can follow. All that remains now is to make sure our XML documents are valid, which means that they are both well-formed and comply with the associated DTD. This is the job of the XML parser.