2.4. Producer-Side Validation
All uses of SAX2 parsers will involve extending and
customizing the basic scenario we saw earlier.
Our next example illustrates two basic configuration mechanisms:
error handling options, which lets you use the
appropriate policy when you see errors, and
parser configuration through feature flags, which
let you control some details of how the parser works.
(Some event handlers are managed with a configuration mechanism
that is quite similar to the feature flag mechanism.)
The example also shows how SAX2 parsers expose the core XML
notion of DTD-based validation.
Validity and XML
Validation is particularly important when you
are interchanging documents that have been wholly
or partially authored by hand, but it can also be helpful
when working with XML that's generated by custom code.
When you validate an XML document, you ensure that it
meets certain rules needed to process them -- such as
requiring a
<title> element as the first child of
every <chapter> element or prohibiting
dangling internal cross-references.
Validation is done at several levels in most
applications. Lower levels tend to use
rule-based logic,
such as the DTD validation that's defined by XML 1.0.
The various types of XML schema provide different
kinds of rule-based sanity checks, which are
usually done before applications see the data.
(W3C's schemas also extract additional information items,
beyond the
XML data model of elements, attributes, and text.
This information is called the "Post-Schema-Validation
Infoset" or PSVI.)
Higher-level validation processes tend to involve
richer notions of data validity and tend to be expressed
as procedural logic.
For example, "business logic" often involves
ad hoc relationships, policies, and heuristics; it relies
on information not normally expressible by
DTD or schema-style rules.
Such logic is often captured in application-level methods.
As a rule, no single data validation technology is
sufficient for all purposes.
Your development process should try to ensure
that you create only valid documents; you will likely
send XML to applications that don't handle invalid
data very well.
Safe operational practice involves validating all documents
received from other parties and accepting the small
costs involved.
(Use local copies of DTDs, or schemas, to avoid depending
on remote files that might disappear. Techniques to achieve this
are discussed in Section 3.4, "The EntityResolver Interface" in Chapter 3, "Producing SAX2 Events".)
The cost of rule-based validation is usually smaller than
routine system load variations for real applications; even in
parsing speed benchmarks it's rarely high.
It's usually worth the cost since it can prevent someone
else's data from accidentally breaking your software.
Validation against a good DTD (or schema) provides a
useful base level of input data checking, but it will
rarely be sufficient.
|
You will often tell XML parsers to validate XML as
they produce events.
Because SAX2 provides access to most of the data
in XML documents, including declarations from DTDs,
it also supports performing such validation
on the event consumer side, possibly with
a cached DTD or schema.
(The consumer side is the only place to perform procedural
validation.)
Such consumer-side validation can be important when you're trying
to make your program output meet the constraints
of a particular information interchange agreement; just add
a streaming validation stage to your output processing.
This approach can also be used for DOM revalidation and
similar purposes.
Here, we look at how to validate data that is already in
the form of XML text.
Keep in mind that some important DTD-related
processing does not involve validation.
Documents with DTDs can
use entity substitution for document modularity and text portability,
and can have attributes defaulted and normalized.
Validation with DTDs only involves checking a set of rules.
Disabling DTD validation turns off only the rule checks,
not the processing for entities and attributes.
2.4.1. SAX2 Feature Flags
SAX2 exposes many parser behaviors, including DTD
validation, using a "feature flag" mechanism.
These flags are Boolean settings,
which may have values
or be unspecified.
Parsers can have up to four different modes for any feature flag.
For example, with the validation flag
SAX2 implies four kinds of XML parsers:
- Optionally validating parsers
The feature flag is read/write and
can be either true or false.
If it's set to false, few nonfatal errors
will be reported and parsing will be a bit faster
(maybe 5 or 10 percent of the cost of parsing
XML, which is usually negligible to start with).
- Nonvalidating parsers
The feature flag is read-only and
always false. Some nonfatal errors might be
reported (the XML specification demands them in
some cases).
- Always validating parsers
The feature flag is read-only and
always true. Validity errors are always
reported as nonfatal.
(By default, such errors are ignored; see Section 2.4.2, "Handling Validity Errors" later in this chapter.)
- Unknown validation behavior
The feature flag is not recognized,
so its value can't be determined.
(This mode is uncommon for the SAX2 validation flag,
but you'll see it with other feature flags.)
Later in this chapter, look at the feature
flags used to characterize namespace processing.
Those flags are not optional, so fewer potential
parser modes are possible.
All the standardized feature flags are detailed
in Section 3.3.2, "XMLReader Feature Flags" in Chapter 3, "Producing SAX2 Events".
In SAX, URIs identify feature flags.
These are used purely as unique identifiers.
This is the same approach used in XML namespaces: don't
use these URIs to retrieve data, even if they
do look like URLs you could type into a browser. The URI
http://xml.org/sax/features/validation
identifies the flag-controlling validation.
URIs = URLs + URNs
The use of URIs in XML namespaces has been
confusing, and since SAX2 also uses URIs
to identify parser feature flags and properties, the same
sort of confusion can show up.
Think of URIs as names: you can talk about "Fred" even if
he's not there, or about "Godot" even if he may not exist,
and "the third house on the left" probably makes sense to
someone standing at your side.
Classically, a Universal Resource
Identifier (URI),
is either a Universal Resource Locator (URL) or a
Universal Resource Name (URN).
Both types of URIs are represented as strings.
You're used to seeing URLs in web browsers; they serve
as detailed addresses. They often look like
http://www.example.com/
but they may use other URI schemes -- for example, they may use https:,
ftp: and file:. The scheme indicates the way to access the resource. URNs use URI schemes that start with urn:. You probably have not seen many URNs; one example is urn:uuid:221ffe10-ae3c-11d1-b66c-00805f8a2676. URN schemes (like uuid in this example) describe what the resource is, more than how to access it.
Filenames are never URIs,
but you can convert a filename into a URL (hence URI)
that works on systems where the original filename was legal.
Just to be confusing, there are also "relative URIs,"
which often look like POSIX-style filenames.
Like filenames, relative URIs should never be handed directly
to a SAX parser or be used as namespace identifiers.
With XML namespaces and SAX2, the term URI is used
to emphasize that the string is being used
as a pure identifier: it's more like a URN than a URL, even when
the URI is syntactically a URL.
It's explicitly irrelevant whether any resource is actually
associated with the URI.
Don't assume you can fetch resources using those URIs.
|
To check how a given XML parser handles validation, use
code similar to Example 2-5.
Code for any other kind of parser feature will look much the same,
as long as you
use the correct ID for the feature flag; you'll see
the same exception types working in the same way.
(The same is true for parser "properties," which you'll
see in Section 3.3.1, "XMLReader Properties" in Chapter 3, "Producing SAX2 Events".)
Example 2-5. Checking for validation support
XMLReader producer;
String uri = "http://xml.org/sax/features/validation";
// ... get the parser
// Try getting and setting the flag
try {
System.out.println ("Initial validation setting: "
+ producer.getFeature (uri));
// if we get here, validation behavior is known
producer.setFeature (uri, true);
// if we get here, the parser either validates by
// default or is optionally validating
} catch (SAXNotSupportedException e) {
// value not supported; parser is nonvalidating
System.out.println ("Can't enable validation: "
+ e.getMessage ());
System.exit (1);
} catch (SAXNotRecognizedException e) {
// feature not understood; parser has weak SAX2 support.
// maybe it's a SAX1 parser inside a ParserAdapter
System.out.println ("Doesn't understand validation: "
+ e.getMessage ());
System.exit (1);
}
As a rule, programs will probably set the validation
flag to true only when they really need reports of
validity errors.
(Why? As we'll see in a moment, it's natural to ignore
reports of validity errors when they're not important,
so it doesn't much matter if you validate when you don't
need to.)
The skeleton program in Example 2-1 really just needs a
setFeature() call and a small update
to the diagnostic message, to be sure it's always
validating.
(The diagnostics could be more precise using some more-specialized
exceptions that we haven't discussed yet.)
// Get an instance of the default XML parser class
try {
producer = XMLReaderFactory.createXMLReader ();
producer.setFeature (
"http://xml.org/sax/features/validation",
true);
} catch (SAXException e) {
System.err.println (
"Can't get validating parser, check configuration: "
+ e.getMessage ());
return;
}
The validation feature flag is probably the most
widely used, with the possible exception of the flags
controlling namespace handling.
Most parsers leave validation off by default
to save some minor parsing overhead.
2.4.2. Handling Validity Errors
If you modify the skeleton program to set the
parser's validation flag and then run it on a well-formed
but invalid document (perhaps one without a DTD), you will
probably be surprised to discover that it doesn't seem to
report any errors. That's exactly what should happen since
it's the default behavior specified by SAX.
To make validity errors cause anything interesting to happen,
you have to change how they're handled.
If you don't change this handling, you won't
be able tell a validating parser apart from a
nonvalidating one!
The simplest way to change the handling of validity
errors is to make them work just like well-formedness errors:
by aborting the parse.
This uses the ErrorHandler interface
that we look at later in this chapter, in Section 2.5.2, "ErrorHandler Interface", but for now it's simpler to
focus on one method.
In terms of the skeleton program shown earlier,
such a change can be an update to just one line,
using an anonymous inner class to make the
code look simple. (Of course, avoid using anonymous
classes for anything complex; they can make code
hard to maintain.)
// Get a consumer for all the parser events
consumer = new DefaultHandler () {
public void error (SAXParseException e)
throws SAXException
{ throw e; }
};
XML parsers call ErrorHandler.error()
whenever they find a validity error, or when they see certain
other nonfatal errors. In this
case, our custom handler adopts a policy that whenever it
sees such an error, it will abort the parse
by throwing the exception reported to it.
Later in this chapter we look at some alternative policies.
When your callback detects serious application-level
errors, you can throw a SAXException
from any SAX event handler callback to abort parsing.
That doesn't have be done only from an ErrorHandler. For example, when input data is valid XML but doesn't meet essential semantic requirements of the application, report it using some kind of SAXException. If your code only knows how to process shipping invoices, then greeting cards should be rejected immediately.
 |  |  | | 2.3. Basic ContentHandler Events |  | 2.5. Exception Handling |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|