Document Validation (Perl and XML)

<!ELEMENT memo (to, from, message)> <!ATTLIST memo priority (urgent|normal|info) 'normal'> <!ENTITY % text-only "(#PCDATA)*"> <!ELEMENT to %text-only;> <!ELEMENT from %text-only;> <!ELEMENT message (#PCDATA | emphasis)*> <!ELEMENT emphasis %text-only;> <!ENTITY myname "Bartholomus Chiggin McNugget">

use XML::LibXML; use IO::Handle; # initialize the parser my $parser = new XML::LibXML; # open a filehandle and parse my $fh = new IO::Handle; if( $fh->fdopen( fileno( STDIN ), "r" )) { my $doc = $parser->parse_fh( $fh ); if( $doc and $doc->is_valid ) { print "Yup, it's valid.\n"; } else { print "Yikes! Validity error.\n"; } $fh->close; }

3.7.2. Schemas

DTDs have limitations; they aren't able to check what kind of character data is in an element and if it matches a particular pattern. What if you wanted a parser to tell you if a <date> element has the wrong format for a date, or if it contains a street address by mistake? For that, you need a solution such as XML Schema. XML Schema is a second generation of DTD and brings more power and flexibility to validation.

As noted in Chapter 2, "An XML Recap", XML Schema enjoys the dubious distinction among the XML-related W3C specification family for being the most controversial schema (at least among hackers). Many people like the concept of schemas, but many don't approve of the XML Schema implementation, which is seen as too cumbersome or constraining to be used effectively.

Alternatives to XML Schema include OASIS-Open's RelaxNG (http://www.oasis-open.org/committees/relax-ng/) and Rick Jelliffe's Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html). Like XML Schema, these specifications detail XML-based languages used to describe other XML-based languages and let a program that knows how to speak that schema use it to validate other XML documents. We find Schematron particularly interesting because it has had a Perl module attached to it for a while (in the form of Kip Hampton's XML::Schematron family).

Schematron is especially interesting to many Perl and XML hackers because it builds on existing popular XML technologies that already have venerable Perl implementations. Schematron defines a very simple language with which you list and group together assertions of what things should look like based on XPath expressions. Instead of a forward-looking grammar that must list and define everything that can possibly appear in the document, you can choose to validate a fraction of it. You can also choose to have elements and attributes validate based on conditions involving anything anywhere else in the document (wherever an XPath expression can reach). In practice, a Schematron document looks and feels like an XSLT stylesheet, and with good reason: it's intended to be fully implementable by way of XSLT. In fact, two of the XML::Schematron Perl modules work by first transforming the user-specified schema document into an XSLT sheet, which it then simply passes through an XSLT processor.

Schematron lacks any kind of built-in data typing, so you can't, for example, do a one-word check to insist that an attribute conforms to the W3C date format. You can, however, have your Perl program make a separate step using any method you'd like (perhaps through the XML::XPath module) to come through date attributes and run a good old Perl regular expression on them. Also note that no schema language will ever provide a way to query an element's content against a database, or perform any other action outside the realm of the document. This is where mixing Perl and schemas can come in very handy.

3.7. Document Validation

3.7.1. DTDs

Example 3-8. A wee little DTD

Example 3-9. A validating parser

3.7.2. Schemas