public void characters(char[] ch, int start, int length)
throws SAXException {
String s = new String(ch, start, length);
DefaultMutableTreeNode data =
new DefaultMutableTreeNode("Character Data: '" + s + "'");
current.add(data);
}
Seemingly a simple callback, this method often results in a
significant amount of confusion because the SAX interface and
standards do not strictly define how this callback must be used for
lengthy pieces of character data. In other words, a parser may choose
to return all contiguous character data in one invocation, or split
this data up into multiple method invocations. For any given element,
this method will be called not at all (if no character data is
present within the element) or one or more times. Parsers implement
this behavior differently, often using algorithms designed to
increase parsing speed. Never count on having all the textual data
for an element within one callback method; conversely, never assume
that multiple callbacks would result from one element's
contiguous character data.
The reason is that a DTD (or schema) details the content model for an
element. In other words, in the JavaXML.dtd file, the
contents element can only have
chapter elements within it. Any whitespace between
the start of the contents element and the start of
a chapter element is (by logic) ignorable. It
doesn't mean anything, because the DTD says not to expect any
character data (whitespace or otherwise). The same thing applies for
whitespace between the end of a chapter element
and the start of another chapter element, or
between it and the end of the contents element.
Because the constraints (in DTD or schema form) specify that no
character data is allowed, this whitespace cannot be meaningful.
However, without a constraint specifying that
information to a parser, that whitespace cannot
be interpreted as meaningless. So by removing the reference to a DTD,
these various whitespaces would trigger the characters(
) callback, where previously they triggered the
ignorableWhitespace( ) callback. Thus whitespace
is never simply ignorable, or nonignorable; it all depends on what
(if any) constraints are referenced. Change the constraints, and you
might change the meaning of the whitespace.
Let's dive even deeper. In the case where an element can only
have other elements within it, things are reasonably clear.
Whitespace in between elements is ignorable. However, consider a
mixed content model:
<!ELEMENT p (b* | i* | a* | #PCDATA)>
If this looks like gibberish, think of HTML; it represents (in part)
the constraints for the p element, or paragraph
tag. Of course, text within this tag can exist, and also bold
(b), italics (i), and links
(a) elements as well. In this model, there is no
whitespace between the starting and ending p tags
that will ever be reported as ignorable (with or without a DTD or
schema reference). That's because it's impossible to
distinguish between whitespace used for readability and whitespace
that is supposed to be in the document. For example:
<p>
<i>Java and XML</i>, 2nd edition, is now available at bookstores, as
well as through O'Reilly at
<a outsideurl=">http://www.oreilly.com</a>.
</p>