3.3. Configuring XMLReader Behavior
A configuration mechanism was one of the key features
added in the SAX2 release. Parsers can support
extensible sets of named Boolean feature
flags and property objects.
These function in similar ways, including using URIs to
identify any number of features and properties.
The exception model, presented in Chapter 2, "Introducing SAX2" in Section 2.4.1, "SAX2 Feature Flags" is used
to distinguish the three basic types of feature or property:
the current value may be read-only, read/write, or undefined.
Some flags and properties may have rules
about when they can be changed (typically not while parsing)
or read.
Applications access property objects and feature flags
through get*() and set*()
methods and use URIs to identify the characteristic of interest.
Since SAX does not provide a way to enumerate such URIs
as supported by a parser, you will need to rely on parser
documentation, or the tables in this section, to identify the
legal identifiers. (Or consult the source code, if you have access to it.)
If you happen to be defining new handlers or features using
the SAX2 framework, you don't have to ask for permission
to define new property or feature flag IDs. Since they are
identified using URIs, just start your ID with a base URI that
you control. (Only the SAX maintainers would start with the
http://xml.org/sax/
URI, for example.) Typically, it will be easiest to make up some HTTP URL
based on a fully qualified domain name that you control. As
with namespace URIs, these are used purely as identifiers rather than
as locations from which data would be retrieved. (The "I" in URI stands for "identifier.")
3.3.1. XMLReader Properties
SAX2 defines two XMLReader calls for accessing
named property objects. One of the most common uses for such
objects is to install non-core event handlers. Accessing
properties is like accessing feature flags, except that the
values associated with these names are objects rather than Booleans:
XMLReader producer ...;
String uri = ...;
Object value = ...;
// Try getting and setting the property
try {
System.out.println ("Initial property setting: "
+ producer.getProperty (uri);
// if we get here, the property is supported
producer.setProperty (uri, value);
// if we get here, the parser set the property
} catch (SAXNotSupportedException e) {
// bad value for property ... maybe wrong type, or parser state
System.out.println ("Can't set property: "
+ e.getMessage ());
System.exit (1);
} catch (SAXNotRecognizedException e) {
// property not supported by this parser
System.out.println ("Doesn't understand property: "
+ e.getMessage ());
System.exit (1);
}
You'll notice the URIs for these standard
properties happen to have a common prefix. This means that
you can declare the prefix (http://xml.org/sax/properties/) as a constant
string and construct the identifiers by string catenation.
Here are the standard properties:
- http://xml.org/sax/properties/declaration-handler
This property holds an implementation of
org.xml.sax.ext.DeclHandler,
used for reporting the DTD declarations that aren't
reported through
org.xml.sax.DTDHandler callbacks or for
the root element name declaration,
org.xml.sax.ext.LexicalHandler
callbacks. This handler is presented in Section 4.3.1, "The DeclHandler Interface ".
Ælfred, Crimson, and Xerces support this property.
In fact, all JAXP-compliant processors must do so.
- http://xml.org/sax/properties/dom-node
Only specialized parsers will support
this property:
parsers that traverse DOM document nodes to
produce streams of corresponding SAX events. (Typical
SAX2 parsers parse XML text instead of DOM content.) When read, this property returns the DOM node corresponding
to the current SAX2 callback.
The property can only be written before a parse, to
specify that the DOM node
beginning and ending the SAX event stream need not be a
org.w3c.dom.Document.
This type of parser is presented later in this chapter,
in Section 3.5.1, "DOM-to-SAX Event Production (and DOM4J, JDOM)".
One example of such a parser is
gnu.xml.util.DomParser,
which is currently packaged along with the
Ælfred parser. At this time, neither Crimson nor Xerces include
such functionality.
- http://xml.org/sax/properties/lexical-handler
This property holds an implementation of
org.xml.sax.ext.LexicalHandler,
used for reporting various events mostly (but
not exclusively) relating to details of XML text that
have no semantic or structural meaning, such as comments. This handler is presented in Chapter 4, "Consuming SAX2 Events" in Section 4.2, "The LexicalHandler Interface ".
Ælfred, Crimson, and Xerces support this property.
In fact, all JAXP-compliant processors must do so.
- http://xml.org/sax/properties/xml-string
This property returns a literal string of
characters associated with the current parser callback event. Exactly which characters are returned isn't specified by SAX2.
An example
would be returning all the characters in the start tag
of an element, including unexpanded entity and character
references as well as excess whitespace and the exact
type of quote characters (single, double) used to delimit
attribute values.
(This feature is intended to be of use when constructing
certain kinds of XML editors, or DTD analyzers, that are
willing to re-parse this data.)
No widely available open source SAX2 parser
currently supports this property.
Applications may find it useful to define their own
types of handler interfaces, assembling sequences of SAX event
"atoms" into higher-level event "molecules" that incorporate
essential application-level semantics (and probably
some procedural validation).
This is the same kind of process model used by W3C's XML schema
processing model: the Post-Schema-Validation Infoset (PSVI)
additions incorporate semantics suited to
processing with that kind of schema. Most applications need
to associate even more semantics with data than are easily captured by such simple
rules (including DTDs and all types of schema).
Those semantics would likely not be understood by any common
XMLReader, but other kinds of SAX
processing components
can help manage such application-level handlers.
You can see an example of this technique in
Example 6-3.
3.3.2. XMLReader Feature Flags
The previous chapter showed how to access feature
flags from SAX parsers and used the standard validation flag
as the primary example.
Accessing feature flags follows the same model as accessing properties, except the
values are boolean not Object.
There are a handful of standard SAX2 feature flags,
which are all you normally need.
The namespace for features is different from the namespace
for properties. You can't set a property to a
java.lang.Boolean value and expect
to have the same effect as setting the feature flag that
happens to use the same identifier.
As with properties, the URIs for these standard
feature flags happen to have a common prefix: http://xml.org/sax/features/.
It's good programming practice to declare the prefix as a
constant and construct these feature identifiers by string
catenation, helping reduce errors.
Also, remember that flags aren't necessarily either
settable (read/write)[17] or readable (supported); some
parsers won't recognize all these flags, and in some cases
these flags expose parser behaviors that don't change.
The standard flags are as follows:
- http://xml.org/sax/features/external-general-entities
The default value for this flag is
parser-specific.
When the parser is validating, and in most other cases,
the flag is true, indicating that the
parser reads all external entities used outside the DTD. When the flag is false, the XML parser won't expand references
to external general entities, so applications won't see
the entire body of documents using such entities.
This value can't be changed during parsing.
Crimson and Xerces only support true
for this property.
(For such parsers, you can get most of the effect of setting this flag
to false by using an EntityResolver
that returns zero-length entities after the first
startElement() event.)Ælfred supports changing the value of this property.
- http://xml.org/sax/features/external-parameter-entities
The default value for this flag is
parser-specific. When the parser is validating, and in most other cases,
the flag is true, indicating the DTD
will be completely processed.
When the flag is false, the XML parser will skip any external DTD subset,
as well as named external parameter entities, so it
won't necessarily read the entire DTD for a document.
This value can't be changed during parsing.
Skipping these entities means attributes
declared in them will not be defaulted or normalized as
expected, and their types won't be known.
As a result, default namespace declarations may get dropped.
Parts of the internal subset after a reference to a skipped
external parameter entity will be ignored.
It also means some general entities might not be declared,
making it impossible to correctly distinguish whether
references to undefined entities are well-formedness
errors.
Normally, you are better off providing an entity
resolver that accesses locally cached copies of your DTD
components, or not using DTDs, rather than disabling
processing of external parameter entities. But don't assume
all the XML you work with will have these DTD entities processed; the XML processors in some web browsers
will not read these entities by default.
Xerces and Crimson only support true
for this property.
(For such parsers, you can get an effect similar to setting
this to false by using an EntityResolver
that returns zero-length entities before the first
startElement() event. The parser won't correctly ignore declarations found later in the DTD.)
Ælfred supports changing the value of this property.
- http://xml.org/sax/features/is-standalone/
This feature flag derives its value
from the document being
parsed, so it is read-only and only available after the
first part of the document has been parsed.
When the flag is true, the document has been declared to be standalone.
If that declaration is correct, then
all external entities may be safely ignored.
This feature is part of XML 1.0 and is intended to reduce
the cost of parsing some documents.
This flag should be part of an upcoming SAX
extensions release.
- http://xml.org/sax/features/lexical-handler/parameter-entities
The default value for this flag is
parser-specific
and is implicitly false if the parser doesn't support
the LexicalHandler
through a parser property.
When the flag is true, the parser will report the beginning and end
of parameter entities through LexicalHandler calls.
(Skipped parameter entities are always reported,
through the appropriate ContentHandler call.)
Parameter entities are distinguished from general entities
because the first character of their entity name will
be a percent sign (%).
The value can't be changed during parsing.
Currently, only theÆlfred parser reports
parameter entities.
- http://xml.org/sax/features/namespaces
This flag defaults to
true in XML
parsers, which indicates the parser performs
namespace processing, reporting xmlns
attributes by
startPrefixMapping() and
endPrefixMapping() calls
and providing namespace URIs for each
element or attribute. Otherwise no such processing
is done at the parser level.
This can't be changed during parsing.
You will leave flag this at its default setting
unless your XML documents aren't guaranteed to conform
to the XML Namespaces specification.
Setting this to false usually gives
some degree of parsing speed improvement, although
it will likely not provide a significant impact on
overall application performance.
If you disable namespaces, make sure you first enable the
namespace-prefixes feature.
This is supported by all SAX2 XML parsers.Ælfred, Crimson, and Xerces support changing the value of this property.
- http://xml.org/sax/features/namespace-prefixes
This flag defaults to
false in XML
parsers, indicating the parser
will not present xmlns* attributes in
its startElement() callbacks.
Unless the flag is true, parsers won't portably present the
qualified names (which include the prefix) used in an
XML document for elements or attributes.
The value can't be changed during parsing.
If you want to see the namespace prefixes for any reason,
including for generating output without further postprocessing
or for performing layered DTD validation, make sure this flag
is set. Also make sure this flag is set if you completely
disable namespace processing (with the
namespaces feature flag), because
otherwise the behavior of a SAX2 parser is undefined.
This is supported by all SAX2 parsers.Ælfred, Crimson, and Xerces support changing the value of this property.
- http://xml.org/sax/features/string-interning
The default value for this flag is
parser-specific.
When true, this indicates that all XML name strings
(except those inside attribute values) and namespace URIs
returned by this parser will have been interned using
String.intern(). Some kind of interning is almost always done to improve the performance of parsers, and this flag exposes this work for the benefit of applications. This value can't be changed during parsing.
When applications know interning has been done,
they know they can rely on fast, identity-based tests
for string equality
(== or !=)
rather than the more expensive
String.equals() method. Using equality testing for strings will always work, but it can be much slower than identity testing. Java automatically interns all string constants. Lots of startElement() processing needs to match element and attribute name strings (as sketched in Example 2-8), so this kind of optimization can often be a win.
Ælfred interns all strings.
Some older versions of Crimson don't recognize this flag,
but all versions should correctly intern those strings.
Xerces reports that it does not intern these strings.
- http://xml.org/sax/features/validation
The default value for this flag is
parser-specific; in most cases it is false.
When the flag is true, the parser is performing XML validation
(with a DTD, unless you've requested otherwise).
When the flag is false, the parser isn't validating.
The value can't be changed while parsing.
Ælfred, when packaged with its optional validator,
Crimson, and Xerces support both settings.
A few additional standard extension features will
likely be defined, providing even more complete Infoset
support from SAX2 XML parsers.Ælfred also includes a nonvalidating parser, which supports only false for this flag.
Of the widely available parsers, only Xerces has
nonstandard feature flags. (The Xerces distribution
includes full documentation for those flags.) As a rule,
avoid most of these, because they are parser-specific and even version-specific. Some are used to disable warnings about extra definitions that aren't errors. (Most parsers don't bother reporting such nonerrors; Xerces reports them by default.) Others promote noncompliant XML validation semantics. Here are a few flags that you may want to use.
- http://apache.org/xml/features/validation/schema
This tells the parser to validate with W3C-style schemas.
The document needs to identify a schema,
and the parser must have namespaces
and validation enabled.
(Defaults to false.)
W3C XML schema validation does not need to be built
into XML parsers. In fact, most currently available
schema validators are layered.
- http://apache.org/xml/features/validation/schema-full-checking
This flag controls whether W3C schema
validation involves all the specified tests.
By default, some of the more expensive checks are
not performed; Xerces is not "fully conforming" by default.
- http://apache.org/xml/features/allow-java-encodings
This flag defaults to false,
limiting the encodings that
the parser accepts to a handful. When the flag is set to true, more encoding names are supported. Most other SAX2 parsers effectively have true as their default. A few of those additional encoding names are Java-specific (such as "UTF8"); most of them are standard encoding names, either the preferred version or recognized alternatives.
- http://apache.org/xml/features/continue-after-fatal-error
When set, this flag permits Xerces to continue
parsing after it invokes
ErrorHandler.fatalError() to
report a nonrecoverable error.
If the error handler doesn't abort parsing by throwing
an exception, Xerces will continue.
The XML specification requires that no more event data be
reported after fatal errors, but it allows additional errors
to be reported.
(Of course, depending on the initial error,
many of the subsequent reports might be nonsense.)
| | | 3.2. Bootstrapping an XMLReader | | 3.4. The EntityResolver Interface |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|