Document Validation (Perl and XML)

3.7.1. DTDs

Document type descriptions (DTDs) are documents written in a special markup language defined in the XML specification, though they themselves are not XML. Everything within these documents is a declaration starting with a <! delimiter and comes in four flavors: elements, attributes, entities, and notations.

Example 3-8 is a very simple DTD.

Example 3-8. A wee little DTD

<!ELEMENT memo (to, from, message)>
<!ATTLIST memo priority (urgent|normal|info) 'normal'>
<!ENTITY % text-only "(#PCDATA)*">
<!ELEMENT to %text-only;>
<!ELEMENT from %text-only;>
<!ELEMENT message (#PCDATA | emphasis)*>
<!ELEMENT emphasis %text-only;>
<!ENTITY myname "Bartholomus Chiggin McNugget">

This DTD declares five elements, an attribute for the <memo> element, a parameter entity to make other declarations cleaner, and an entity that can be used inside a document instance. Based on this information, a validating parser can reject or approve a document. The following document would pass muster:

<!DOCTYPE memo SYSTEM "/dtdstuff/memo.dtd">
<memo priority="info">
  <to>Sara Bellum</to>
  <from>&myname;</from>
  <message>Stop reading memos and get back to work!</message>
</memo>

If you removed the <to> element from the document, it would suddenly become invalid. A well-formedness checker wouldn't give a hoot about missing elements. Thus, you see the value of validation.

Because DTDs are so easy to parse, some general XML processors include the ability to validate the documents they parse against DTDs. XML::LibXML is one such parser. A very simple validating parser is shown in Example 3-9.

Example 3-9. A validating parser

use XML::LibXML;
use IO::Handle;

# initialize the parser
my $parser = new XML::LibXML;

# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    if( $doc and $doc->is_valid ) {
        print "Yup, it's valid.\n";
    } else {
        print "Yikes! Validity error.\n";
    }
    $fh->close;
}

This parser would be simple to add to any program that requires valid input documents. Unfortunately, it doesn't give any information about what specific problem makes it invalid (e.g., an element in an improper place), so you wouldn't want to use it as a general-purpose validity checking tool.[19] T. J. Mather's XML::Checker is a better module for reporting specific validation errors.

[19]The authors prefer to use a command-line tool called nsgmls available from http://www.jclark.com/. Public web sites, such as http://www.stg.brown.edu/service/xmlvalid/, can also validate arbitrary documents. Note that, in these cases, the XML document must have a DOCTYPE declaration, whose system identifier (if it has one) must contain a resolvable URL and not a path on your local system.

3.7.2. Schemas

DTDs have limitations; they aren't able to check what kind of character data is in an element and if it matches a particular pattern. What if you wanted a parser to tell you if a <date> element has the wrong format for a date, or if it contains a street address by mistake? For that, you need a solution such as XML Schema. XML Schema is a second generation of DTD and brings more power and flexibility to validation.

As noted in Chapter 2, "An XML Recap", XML Schema enjoys the dubious distinction among the XML-related W3C specification family for being the most controversial schema (at least among hackers). Many people like the concept of schemas, but many don't approve of the XML Schema implementation, which is seen as too cumbersome or constraining to be used effectively.

Alternatives to XML Schema include OASIS-Open's RelaxNG (http://www.oasis-open.org/committees/relax-ng/) and Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html). Like XML Schema, these specifications detail XML-based languages used to describe other XML-based languages and let a program that knows how to speak that schema use it to validate other XML documents. We find Schematron particularly interesting because it has had a Perl module attached to it for a while (in the form of Kip Hampton's XML::Schematron family).

Schematron is especially interesting to many Perl and XML hackers because it builds on existing popular XML technologies that already have venerable Perl implementations. Schematron defines a very simple language with which you list and group together assertions of what things should look like based on XPath expressions. Instead of a forward-looking grammar that must list and define everything that can possibly appear in the document, you can choose to validate a fraction of it. You can also choose to have elements and attributes validate based on conditions involving anything anywhere else in the document (wherever an XPath expression can reach). In practice, a Schematron document looks and feels like an XSLT stylesheet, and with good reason: it's intended to be fully implementable by way of XSLT. In fact, two of the XML::Schematron Perl modules work by first transforming the user-specified schema document into an XSLT sheet, which it then simply passes through an XSLT processor.

Schematron lacks any kind of built-in data typing, so you can't, for example, do a one-word check to insist that an attribute conforms to the W3C date format. You can, however, have your Perl program make a separate step using any method you'd like (perhaps through the XML::XPath module) to come through date attributes and run a good old Perl regular expression on them. Also note that no schema language will ever provide a way to query an element's content against a database, or perform any other action outside the realm of the document. This is where mixing Perl and schemas can come in very handy.

3.7. Document Validation

3.7.1. DTDs

Example 3-8. A wee little DTD

Example 3-9. A validating parser

3.7.2. Schemas