Remember that you don't need to validate every XML
document that passes over your desk. DTDs and other validation
schemes shine when working with specific XML-based markup languages
(such as XHTML for web pages, MathML for equations, or CaveML for
spelunking), which have strict rules about which elements and
attributes go where (because having an automated way to draw
attention to something fishy in the document structure becomes a
feature).
However, validation usually isn't crucial when you
use Perl and XML to perform a less specific task, such as tossing
together XML documents on the fly based on some other, less sane data
format, or when ripping apart and analyzing existing XML documents.
Basically, if you feel that validation is a needless step for the job
at hand, you're probably right. However, if you
knowingly generate or modify some flavor of XML that needs to stick
to a defined standard, then taking the extra step or three necessary
to perform document validation is probably wise. Your toolbox,
naturally, gives you lots of ways to do this. Read on.
3.7.2. Schemas
DTDs have
limitations; they aren't able to check what kind of
character data is in an element and if it matches a particular
pattern. What if you wanted a parser to tell you if a
<date> element has the wrong format for a
date, or if it contains a street address by mistake? For that, you
need a solution such as XML Schema. XML Schema is a second generation
of DTD and brings more power and flexibility to validation.
As noted in Chapter 2, "An XML Recap", XML Schema enjoys the
dubious distinction among the XML-related W3C specification family
for being the most controversial schema (at least among hackers).
Many people like the concept of schemas, but many
don't approve of the XML Schema implementation,
which is seen as too cumbersome or constraining to be used
effectively.
Alternatives to XML Schema include
OASIS-Open's
RelaxNG (http://www.oasis-open.org/committees/relax-ng/)
and Rick Jelliffe's
Schematron
(http://www.ascc.net/xml/resource/schematron/schematron.html).
Like XML Schema, these specifications detail XML-based languages used
to describe other XML-based languages and let a program that knows
how to speak that schema use it to validate other XML documents. We
find Schematron particularly interesting because it has had a Perl
module attached to it for a while (in the form of Kip
Hampton's XML::Schematron
family).
Schematron is especially interesting to many Perl and XML hackers
because it builds on existing popular XML technologies that already
have venerable Perl implementations. Schematron defines a very simple
language with which you list and group together assertions of what
things should look like based on XPath expressions. Instead of a
forward-looking grammar that must list and define everything that can
possibly appear in the document, you can choose to validate a
fraction of it. You can also choose to have elements and attributes
validate based on conditions involving anything anywhere else in the
document (wherever an XPath expression can reach). In practice, a
Schematron document looks and feels like an XSLT stylesheet, and with
good reason: it's intended to be fully implementable
by way of XSLT. In fact, two of the
XML::Schematron Perl modules work by first transforming
the user-specified schema document into an XSLT sheet, which it then
simply passes through an XSLT processor.
Schematron lacks any kind of built-in data typing, so you
can't, for example, do a one-word check to insist
that an attribute conforms to the W3C date format. You can, however,
have your Perl program make a separate step using any method
you'd like (perhaps through the
XML::XPath module) to come through date attributes
and run a good old Perl regular expression on them. Also note that no
schema language will ever provide a way to query an
element's content against a database, or perform any
other action outside the realm of the document. This is where mixing
Perl and schemas can come in very handy.