Document Type Definition (CGI Programming with Perl)

14.3. Document Type Definition

A document type definition (DTD) tells us how the XML document is structured and what the tags mean in relation to one another. Notice that the second line in the quiz XML example contains a document type definition indicated by a <!DOCTYPE> tag. This tag references a file that contains the DTD for this XML structure. Generally, this <!DOCTYPE> tag is used when an XML parser wants to validate the XML against a more strict definition.

For example, the XML shown above could easily be parsed without the DTD. However, the DTD may offer additional hints to the XML parser to further validate the document. Here's a sample quiz.dtd file:

<?xml version-"1.0">
<!ELEMENT QUIZ (QUESTION*)>
<!ELEMENT QUESTION (ASK+,CHOICE*,ANSWER+,RESPONSE+)>
<!ATTLIST QUESTION
  TYPE CDATA #REQUIRED>

<!ELEMENT ASK (#PCDATA)>
<!ELEMENT CHOICE EMPTY>
    <!ATTLIST CHOICE
         VALUE CDATA #REQUIRED
         TEXT CDATA #REQUIRED>
<!ELEMENT ANSWER (#PCDATA)>
<!ELEMENT RESPONSE (#PCDATA)>
    <!ATTLIST RESPONSE
         VALUE CDATA
         STATUS CDATA>

The <!ELEMENT> tags describe the actual tags that are valid in the XML document. In this case, <QUIZ>, <QUESTION>, <ASK>, <CHOICE>, <ANSWER>, and <RESPONSE> tags are available for use in an XML document compliant with the quiz.dtd file.

The parentheses after the name of the element show what tags it can contain. The * symbol is a quantity identifier. It follows the same basic rules as regular expression matching. For example, a * symbol indicates zero or more of that element is expected to be contained. If we wanted to indicate zero or one, we would have placed a ? in place of the *. Likewise, if we wanted to indicate that one or more of that element has to be contained inside the tag, then we would have used + . #PCDATA is used to indicate that the element contains character data.

For this example, the <QUIZ> tag expects to contain zero or more QUESTION elements while the <QUESTION> tag expects to contain at least one question, answer, and response. Questions can also have zero or more choices. Furthermore, the CHOICE element definition later in the DTD uses the EMPTY keyword to indicate that it is a single tag that appears by itself; it does not enclose anything. The ASK element contains character data only.

After each element is defined, its attributes need to be laid out. Questions have a type attribute that takes a string of character data. Furthermore, the #REQUIRED keyword indicates that this data is required in the XML document. The other attribute definitions follow a similar pattern in the quiz.dtd file.

The DTD file is optional. You can still parse an XML document without a document type definition. However, with the DTD, the XML parser is provided with rules that the data validation should be based on. Maintaining these validation rules centrally allows the XML format to change without having to make as many changes to the parser code. Parsers that do not use a DTD are called non-validating XML parsers; the standard Perl module for parsing XML documents, XML::Parser, is a non-validating XML parser.

Presumably, anybody writing a quiz will use an editor that checks their XML against the DTD, or will run the document through a validating program. Thus, our program will never encounter a question that does not contain an answer, or some other violation of the DTD.

When a program knows the structure of an XML document using a DTD, it can make other assumptions on how to display that data. For example, a browser could be programmed so that when a quiz document is encountered, it will display the available questions in a list even if only one question was present in the document itself. Because the DTD tells us that it is possible for many questions to appear in the file, the browser can determine the context in which to display the data in the XML document.

The ability to decouple validation rules from the parser is especially important on the Web. With the potential for many people to write code that draws information from an XML data source, any type of mechanism that prevents changes in the XML definition from breaking those parsers will make for a more robust network.