4.3. Exposing DTD Information
SAX2 exposes DTD information through three different
interfaces. Part of it is exposed through the LexicalHandler
extension interface: the DTD's root element type declaration
and boundaries of the various entities. The rest
is exposed through two DTD-specific interfaces, presented here.
When you're working with streams of SAX event data, remember
that all DTD event data is seen before the document data
it describes.
This means that if you need it inside the document, you'll
need to plan ahead to save the DTD data.
It also means that if you need to merge streams of event data,
such DTD data may create a problem.
Unless you know the DTD data in advance,
you'd need to dam up the event stream until all data that
needs to go into downstream DTD events is in hand. Only
then can you send the events downstream (with the DTD first).
Luckily, merging event streams with unknown DTD data isn't common.
DTD information is automatically used inside XML parsers when they parse
XML documents. That includes expansion of conditional sections and parameter entities in
DTDs, expanding general entities, and normalizing or defaulting attributes. Most DTD validation
can be cleanly layered on top of SAX2 since these declaration
callbacks provide all the most important information.[20]
SAX2 enables application-level processing of DTD constraints;
the only internal support it provides for DTDs is
a feature flag to expose parser support for validation.
When applications need to construct valid documents, they
can use DTD information as they make changes, instead of
needing to save the document and reparse the whole thing.
The support for working with DTDs provided by most XML
tools is not as good as the support provided by SAX2. For example,
DOM Level 2 provides weaker support, and the TRAX support for SAX
(java.xml.transform.sax) doesn't
support DeclHandler at all.
Note that while a fully featured SAX2 parser will let you re-create the internal subset, it will not let you round-trip any external parameter entities. That's because parameter entities will be expanded. You will not see conditional sections in external PEs, or declarations being built up from parameter entities. Instead, you'll see the actual declarations that apply to your documents. This may help you to understand exactly what a complex DTD is doing.
4.3.1. The DeclHandler Interface
This extension interface is new in SAX2.
It's in the org.xml.sax.ext package,
which means among other things that it is optional and not
all SAX APIs support it. (DefaultHandler
is one example of an API that does not.)
However, any SAX2 parser that can be bootstrapped with JAXP
must support this interface.
There is no setDeclHandler() method;
bind these handlers to parsers like this:
XMLReader producer = ...;
DeclHandler handler = ...;
producer.setProperty ("http://xml.org/sax/properties/
declaration-handler",handler);
// throws SAXNotSupportedException if parameter isn't a DeclHandler.
// throws SAXNotRecognizedException if parser doesn't support it.
Parsers that support DeclHandler
are essential for applications that need to work with declarations
of elements and attributes or with parsed entities.
DOM requires such support for parsed entities, although even
Level 2 hides or ignores element and attribute
type data.
This interface is the most common way SAX2 exposes type
constraints (the
primary role of a Document Type Declaration) from DTDs, so
if you need to see those constraints, you'll use this handler.
It has four API callbacks:
-
void attributeDecl(eName,aName,type,mode,value)
This callback reports
<!ATTLIST ... >
declarations in a DTD. A given declaration
produces
one callback for each attribute in the declaration.
Much of this information will also be provided through
Attributes methods if
an instance of that element appears in a document.
-
String eName
This is the name of the element
whose attribute is being declared.
-
String aName
This is the name of the
attribute associated with that element.
-
String type
This is one of the strings
CDATA, ID,
IDREF, IDREFS,
NMTOKEN,
NMTOKENS,
ENTITY,
or ENTITIES,
or two types of enumerated values. Enumerated
values are encoded with parenthesized strings such
as (a|b|c) to indicate that
strings a,
b, or c
are permissible. If the string is an
enumeration of notation names,
"NOTATION " (which
includes one space) precedes that
parenthesized string.
This type information is more complete
than information you get through the
Attributes object provided
with startElement(), because
Attributes reports only
enumerations as being either NOTATION or NMTOKEN. However, at this time several widely available SAX2 parsers conform to a beta test version of this API and don't correctly report enumerations. You may need to get a bug-fixed version of your parser if you're depending on this support.
-
String mode
This describes the kind of default
value applied to this attribute:
#IMPLIED (the application
determines the value),
#REQUIRED (the value must be given;
defaulting is not permitted),
#FIXED (only one value
is permitted),
or null indicating that
value is the default.
Unless the document provided a value,
you won't see #IMPLIED
attributes in the
Attributes object provided
with startElement(); if you
need to know this information, save it when you
get this callback.
-
String value
This parameter is either null
or a string with the
default value for this attribute. That might be
the only permitted value if the attribute mode is
#FIXED.
The value will be reported exactly as applications will
see it: normalized and with character and
entity references replaced.
XML structure editors can use this information
to constrain the choices presented to document
authors so that only valid documents can be created.
Other tools that construct documents will also benefit
from having this information.
When you're mostly reading documents rather than
creating them, the most important data here tends to
be declaration of ID,
IDREF, and IDREFS
attributes, which are
used to build links within and between XML
documents.
If more than one declaration for an attribute
is provided, only the first one will be used.
(The second one will be ignored;
unlike the analogous case for element declarations,
attribute redeclaration is not a validity error.)
Normally code to implement this callback would
first retrieve any existing per-element data
structure, or it would create one
(with a null content model)
if none is yet known.
Then if there is no record of an attribute with this
name for that element, a per-attribute data structure
instance would be created and saved in the element
data structure, keyed by attribute name.
-
void elementDecl(name,model)
This method reports
<!ELEMENT ... >
declarations in a DTD.
-
String name
This is the element name.
-
String model
This is the element content model,
with all whitespace removed.
For example, element content models like
(a,(b|c)+,d?), mixed content models
like (#PCDATA|one|two|three)*,
and simple models like ANY
and EMPTY
may all be found in the same document. Note that parsers may do more than just remove the whitespace, as long as an equivalent content model is reported.
Because the content model is provided as a string,
applications using it must always parse it themselves. Similarly, if applications want to validate against that model, they must provide code to do that. Except for the case of element content, such work is straightforward. Validating element content models requires constructing and using some sort of finite state automaton, and it takes a bit of work to parse the model. Mixed content models are easier to handle since they can be parsed with a java.util.StringTokenizer and because the validation logic is simpler.
If more than one declaration for an element
is provided, only the first one will be used.
(The second one will be considered a validity error;
element type redeclaration is not allowed.)
Normally the code implementing this callback would
create a new per-element data structure to save the
name and content model and store it in data structure
(hash table or other map) keyed by element name. Such a data structure might already exist if an element attribute was declared before the element. In this case, this callback just provides the content model, which was previously unknown.
-
void externalEntityDecl(name,publicId,systemId)
This callback reports
<!ENTITY ... >
declarations in a DTD for parsed external entities. These may be either general or parameter entities.
-
String name
This is the entity name; it is
always provided. Names that start with
% are
parameter entities; all others are general entities.
-
String publicId
This is the public ID for the
entity and can be omitted (provided as null).
If public IDs are provided, any embedded
whitespace is normalized,
so these strings may be directly compared.
They may be used to determine a location for
the entity, for example, by using an SGML Formal
Public Identifier with some sort of catalog.
-
String systemId
This is the system ID for the entity
and is always provided. It is an absolute
URI, which parsers normally use
to retrieve the entity before parsing it. However, some SAX2 parsers have a bug, and won't report the absolute URI here.
Applications usually ignore all parameter entity
declarations and use the
org.xml.sax.EntityResolver when
they want to provide local copies of these entities to
a parser.
If applications don't ignore these declarations,
redeclaration should be ignored
(it is not an error).
XML editors may want to offer menus of external
(and internal) entities when editing element content.
And in some cases you may want to track external entities
by name so that you can tell when
LexicalHandler.startEntity()
is reporting the start of one; this is useful for
applications that use xml:base
attributes to change applications' views of the actual
URI that contains an element, using the
Locator.getSystemId() method.
(Perhaps the actual location was not known, or
should for some reason be ignored.)
-
void internalEntityDecl(name,value)
This callback reports
<!ENTITY ... >
declarations in a DTD for (parsed) internal entities.
These may be either general or parameter entities.
-
String name
This is the entity name.
Names that start with % are
parameter entities, all others are general entities.
-
String value
This is the entity value, which
contains arbitrary XML content (including
elements and nested entity references) that will
be reparsed when this entity is expanded.
Applications normally ignore all parameter entity
declarations.
If applications don't ignore these declarations,
redeclaration for a name should be be ignored
(it is not an error).
XML editors may want to offer menus of
internal entities when they edit attribute values or
element content. However, SAX2 does not report
entity references inside the attribute values it parses.
This means that you won't be able to re-create such
text without heuristics.
4.3.2. The DTDHandler Interface
The DTDHandler interface
was carried unchanged from SAX1
into SAX2 and is primarily useful for applications that work
with two specific SGML notions: notations and unparsed
entities. Some DTDs, such as XML DocBook, use notations in such traditional roles. DOM also requires such support. Use XMLReader.setDTDHandler() to bind this handler to a parser. You probably won't ever need to use it for new code. On the Web, those SGML notions correspond roughly to MIME types and URIs respectively, web concepts that are much more widely understood and supported. The interface has only two API callbacks, provided to meet specific requirements in the XML 1.0 specification:
-
void notationDecl(name,publicId,systemId)
This callback reports a
<!NOTATION ...>
declaration in a DTD.
-
String name
This is the notation name; it is
always provided.
These names are used explicitly in unparsed entity
declarations and in some kinds of
attribute declaration (elements can have one such
attribute, used to associate type with the element).
Also, some applications follow a
convention that they may be used to identify
processing instruction targets.
-
String publicId
This is the public ID for the
notation and may be omitted (provided as null). If public IDs are supplied, then any embedded whitespace is normalized, so these strings may be directly compared. These may be used to assign a meaning to the notation, for example, by using an SGML Formal Public Identifier in a role much like a MIME type.
-
String systemId
This is the system ID for the
notation and may be omitted (provided as null).
When provided, it is an absolute URI. However, some SAX2 parsers have a bug, and won't report the absolute URI here. These may be used to assign a meaning to the notation, for example, by using a URI to identify a type or command.
In addition to assigning types to
unparsed entities, a NOTATION
attribute may also
associate a type with an element
or processing instruction.
Some DTDs provide extensive catalogs of notation
declarations specifically for such uses.
Note that notation declarations are the one place
in XML syntax where you can provide a public ID without
a system ID, and that at least one identifier (public
or system) must always be provided.
If applications don't ignore these declarations,
redeclaration should be ignored
(it is not an error).
-
void unparsedEntityDecl(name,publicId,systemId,notation)
This callback reports
<!ENTITY ... >
declarations with
NDATA annotations to associate
them with a notation (such as jpeg
or png).
Unparsed entities are used only in
attributes that are declared to be of type
ENTITY or
ENTITIES.
-
String name
This is the name of the unparsed
entity; it is always provided.
-
String publicId
This is the public ID for the
notation and may be omitted (provided as null).
If public IDs are provided, any embedded
whitespace is normalized,
so these strings may be directly compared.
These may be used to assign a location to the
entity, for example, by using an SGML Formal
Public Identifier in a role much like a URN.
-
String systemId
This is the system ID for the
notation and is always provided.
It is normally an absolute URI. However, some SAX2 parsers have a bug, and won't report the absolute URI here. These may be used to assign a location to the entity.
-
String notation
This is the name of the notation
associated with the entity; it is always provided.
The role of these names is much like that of
an external MIME type annotation for the entity.
In XML, unparsed entities are declared to parsers
but pass through them without being parsed. Classic
examples of unparsed entities include JPEG or PNG image
files.
Such entities may also be used for XML text that just
doesn't need to be parsed in a given processing stage.
If applications don't ignore these declarations,
redeclaration should be be ignored
(it is not an error).
Most XML applications that care about unparsed entities
and notations do so because they interface with SGML systems that
use them or are migrating such systems to use the
XML generation of tools.
XML editors supporting this functionality might use
these event callbacks to create menus of notations or unparsed
entities when they are editing attributes that hold such values.
Applications that use this interface will normally
use the callbacks to create two tables, keyed by
entity or notation name respectively, that are used to
interpret element attributes.
More rarely, notations will be used to determine the operation
corresponding to a given processing instruction target name.
Secure applications will never use notations to directly encode
system commands, but will always redirect through application
controlled tables. For example, it would be foolish to
rely on system IDs found in a document.
System IDs such as rm -rf /,
when run through a Unix or Linux shell, would remove
all files accessible through the local system.
| | | 4.2. The LexicalHandler Interface | | 4.4. Turning SAX Events into Data Structures |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|