Chapter 19. Simple API for XML (SAX)
The Simple API for XML
(SAX) is a straightforward, event-based API for reading XML
documents. Many different XML parsers, including Xerces, Crimson,
MSXML, the Oracle XML Parser for Java, and Ælfred,
implement the SAX API. SAX was originally defined as a Java API and
is primarily intended for parsers written in Java. Therefore, this
chapter focuses on the Java version of the API. However, SAX has been
ported to most other major object-oriented languages, including C++,
Python, Perl, and Eiffel. The translation from Java is usually fairly
obvious.
The SAX API
is unusual among XML APIs because it's an
event-based push model rather than a tree-based pull model. As the
XML parser reads an XML document, it sends your program information
from the document in real time. Each time the parser sees a
start-tag, an end-tag, character data, or a processing instruction,
it tells your program. The document is presented to your program one
piece at a time from beginning to end. You can either save the pieces
you're interested in until the entire document has
been read or process the information as soon as you receive it. You
do not have to wait for the entire document to be read before acting
on the data at the beginning of the document. Most importantly, the
entire document does not have to reside in memory. This feature makes
SAX the API of choice for very large documents that do not fit into
available memory.
TIP:
This chapter covers SAX2 exclusively. In 2002 all major parsers that
support SAX support SAX2. The major change in SAX2 from SAX1 is the
addition of namespace support. This addition necessitated changing
the names and signatures of almost every method and class in SAX. The
old SAX1 methods and classes are still available, but
they're now deprecated, and you
shouldn't use them.
SAX is primarily a
collection of interfaces in the
org.xml.sax
package. One such interface is
XMLReader .
This interface represents the XML parser. It declares methods to
parse a document and configure the parsing process, for instance, by
turning validation on or off. To parse a document with
SAX, first create an instance of
XMLReader with the
XMLReaderFactory class in the
org.xml.sax.helpers package. This class has a
static createXMLReader( )
factory method that produces the parser-specific implementation of
the XMLReader interface. The Java system property
org.xml.sax.driver specifies the concrete class to
instantiate: try {
XMLReader parser = XMLReaderFactory.createXMLReader( );
// parse the document...
}
catch (SAXException e) {
// couldn't create the XMLReader
}
The call to XMLReaderFactory.createXMLReader( ) is
wrapped in a try-catch block
that catches
SAXException .
This is the generic checked exception superclass for almost anything
that can go wrong while parsing an XML document. In this case, it
means either that the org.xml.sax.driver system
property wasn't set or that it was set to the name
of a class that Java couldn't find in the class
path.
You can choose which concrete class to instantiate by passing its
name as a string to the createXMLReader( ) method.
This code fragment instantiates the Xerces parser by name:
try {
XMLReader parser = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
// parse the document...
}
catch (SAXException e) {
// couldn't create the XMLReader
}
Now that you've created a parser,
you're ready to parse some documents with it. Pass
the system ID of the document you want to parse to the
parse( ) method. The
system ID is either an absolute or a relative URL encoded in a
string. For example, this code fragment parses the document at
http://www.slashdot.org/slashdot.xml:
try {
XMLReader parser = XMLReaderFactory.createXMLReader( );
parser.parse("http://www.slashdot.org/slashdot.xml");
}
catch (SAXParseException e) {
// Well-formedness error
}
catch (SAXException e) {
// Could not find an XMLReader implementation class
}
catch (IOException e) {
// Some sort of I/O error prevented the document from being completely
// downloaded from the server
}
The parse( ) method throws a
SAXParseException if the document is malformed, an
IOException if an I/O error such as a broken
socket occurs while the document is being read, and a
SAXException if anything else goes wrong.
Otherwise, it returns void. To receive information
from the parser as it reads the document, you must configure it with
a ContentHandler.
19.1. The ContentHandler Interface
ContentHandler, shown
in stripped-down form in Example 19-1, is an
interface in the org.xml.sax package.
You implement this interface in a class of your own devising. Next,
you configure an XMLReader with an
instance of your implementation. As the XMLReader
reads the document, it invokes the methods in your object to tell
your program what's in the XML document. You can
respond to these method invocations in any way you see fit.
TIP:
This class has no relation to the moribund
java.net.ContentHandler class. However, you may encounter a
name conflict if you import both java.net.* and
org.xml.sax.* in the same class.
It's better to import just the
java.net classes you actually need, rather than
the entire package.
Example 19-1. The org.xml.sax.ContentHandler Interface
package org.xml.sax;
public interface ContentHandler {
public void setDocumentLocator(Locator locator);
public void startDocument( ) throws SAXException;
public void endDocument( ) throws SAXException;
public void startPrefixMapping(String prefix, String uri)
throws SAXException;
public void endPrefixMapping(String prefix) throws SAXException;
public void startElement(String namespaceURI, String localName,
String qualifiedName, Attributes atts) throws SAXException;
public void endElement(String namespaceURI, String localName,
String qualifiedName) throws SAXException;
public void characters(char[] text, int start, int length)
throws SAXException;
public void ignorableWhitespace(char[] text, int start, int length)
throws SAXException;
public void processingInstruction(String target, String data)
throws SAXException;
public void skippedEntity(String name) throws SAXException;
}
Every time the XMLReader reads a piece of the
document, it calls a method in its ContentHandler.
Suppose a parser reads the simple document shown in Example 19-2.
Example 19-2. A simple XML document
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
The parser will call these methods in its
ContentHandler with these arguments in this order.
The values of the arguments passed to each method are given after
each method name:
-
setDocumentLocator(Locator locator)
locator: org.apache.xerces.readers.DefaultEntityHandler@1f953d
-
startDocument( )
-
processingInstruction(String target, String data)
target: "xml-stylesheet"
data: "type='text/css' href='person.css'"
-
startPrefixMapping(String prefix, String namespaceURI)
prefix: ""
namespaceURI: "http://xml.oreilly.com/person"
-
startElement(String namespaceURI, String localName,
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/person"
localName: "person"
qualifiedName: "person"
atts: {} (no attributes, an empty list)
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 181
length: 3
-
startPrefixMapping(String prefix, String uri)
prefix: "name"
uri: "http://xml.oreilly.com/name")
-
startElement(String namespaceURI, String localName,
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "name"
qualifiedName: "name:name"
atts: {} (no attributes, an empty list)
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 236
length: 5
-
startElement(String namespaceURI, String localName,
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "first"
qualifiedName: "name:first"
atts: {} (no attributes, an empty list)
-
characters(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 253
length: 6
-
endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "first"
qualifiedName: "name:first"
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 272
length: 5
-
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "last"
qualifiedName: "name:last"
atts: {} (no attributes, an empty list)
-
characters(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 288
length: 3
-
endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "last"
qualifiedName: "name:last"
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 303
length: 3
-
endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "name"
qualifiedName: "name:name"
-
endPrefixMapping(String prefix)
prefix: "name"
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 318
length: 3
-
startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/person"
localName: "assignment"
qualifiedName: "assignment
atts: {project_id="p2"}
-
endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/person"
localName: "assignment"
qualifiedName: "assignment"
-
ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
<name:name xmlns:name="http://xml.oreilly.com/name">
<name:first>Sydney</name:first>
<name:last>Lee</name:last>
</name:name>
<assignment project_id="p2"/>
</person>
start: 350
length: 1
-
endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/person"
localName: "person"
qualifiedName: "person"
-
endPrefixMapping(String prefix)
prefix: ""
-
endDocument( )
Some
pieces
of this are not deterministic. Note
that the char array passed to each call to
characters( ) and ignorableWhitespace(
) actually contains the entire document! The specific text
block that the parser really returns is indicated by the second two
arguments. This is an optimization that Xerces-J performs. Other
parsers are free to pass different char arrays as
long as they set the start and
length arguments to match. Indeed, the parser is
also free to split a long run of plain text across multiple calls to
characters( ) or ignorableWhitespace(
), so you cannot assume that these methods necessarily
return the longest possible contiguous run of plain text. Other
details that may change from parser to parser include attribute order
within a tag and whether a Locator object is
provided by calling setDocumentLocator( ).
Suppose you want to count the
number of elements, attributes, processing instructions, and
characters of plain text that exist in a given XML document. To do
so, first write a class that implements the
ContentHandler interface. The current count of
each of the four items of interest is stored in a field. The field
values are initialized to zero in the
startDocument( )
method, which is called exactly once for each document parsed. Each
callback method in the class increments the relevant field. The
endDocument( ) method reports the total for that
document. Example 19-3 is such a class.
Example 19-3. The XMLCounter ContentHandler
import org.xml.sax.*;
public class XMLCounter implements ContentHandler {
private int numberOfElements;
private int numberOfAttributes;
private int numberOfProcessingInstructions;
private int numberOfCharacters;
public void startDocument( ) throws SAXException {
numberOfElements = 0;
numberOfAttributes = 0;
numberOfProcessingInstructions = 0;
numberOfCharacters = 0;
}
// We should count either the start-tag of the element or the end-tag,
// but not both. Empty elements are reported by each of these methods.
public void startElement(String namespaceURI, String localName,
String qualifiedName, Attributes atts) throws SAXException {
numberOfElements++;
numberOfAttributes += atts.getLength( );
}
public void endElement(String namespaceURI, String localName,
String qualifiedName) throws SAXException {}
public void characters(char[] text, int start, int length)
throws SAXException {
numberOfCharacters += length;
}
public void ignorableWhitespace(char[] text, int start, int length)
throws SAXException {
numberOfCharacters += length;
}
public void processingInstruction(String target, String data)
throws SAXException {
numberOfProcessingInstructions++;
}
// Now that the document is done, we can print out the final results
public void endDocument( ) throws SAXException {
System.out.println("Number of elements: " + numberOfElements);
System.out.println("Number of attributes: " + numberOfAttributes);
System.out.println("Number of processing instructions: "
+ numberOfProcessingInstructions);
System.out.println("Number of characters of plain text: "
+ numberOfCharacters);
}
// Do-nothing methods we have to implement only to fulfill
// the interface requirements:
public void setDocumentLocator(Locator locator) {}
public void startPrefixMapping(String prefix, String uri)
throws SAXException {}
public void endPrefixMapping(String prefix) throws SAXException {}
public void skippedEntity(String name) throws SAXException {}
}
TIP:
This class needs to override most methods in the
ContentHandler
interface. However, if you only really want to provide one or two
ContentHandler methods, you may want to subclass
the
DefaultHandler class instead. This adapter class
implements all methods in the ContentHandler
interface with do-nothing methods, so you only have to override
methods in which you're genuinely interested.
Next, build an XMLReader, and
configure it with an instance of this class. Finally, parse the
documents you want to count, as in Example 19-4.
Example 19-4. The DocumentStatistics driver class
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.IOException;
public class DocumentStatistics {
public static void main(String[] args) {
XMLReader parser;
try {
parser = XMLReaderFactory.createXMLReader( );
}
catch (SAXException e) {
// fall back on Xerces parser by name
try {
parser = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
}
catch (SAXException ee) {
System.err.println("Couldn't locate a SAX parser");
return;
}
}
if (args.length == 0) {
System.out.println(
"Usage: java DocumentStatistics URL1 URL2...");
}
// Install the Content Handler
parser.setContentHandler(new XMLCounter( ));
// start parsing...
for (int i = 0; i < args.length; i++) {
// command line should offer URIs or file names
try {
parser.parse(args[i]);
}
catch (SAXParseException e) { // well-formedness error
System.out.println(args[i] + " is not well formed.");
System.out.println(e.getMessage( )
+ " at line " + e.getLineNumber( )
+ ", column " + e.getColumnNumber( ));
}
catch (SAXException e) { // some other kind of error
System.out.println(e.getMessage( ));
}
catch (IOException e) {
System.out.println("Could not report on " + args[i]
+ " because of the IOException " + e);
}
}
}
}
Running the program in Example 19-4 across the
document in Example 19-2 results in the following
output:
D:\books\xian\examples\18>java DocumentStatistics 18-2.xml
Number of elements: 5
Number of attributes: 1
Number of processing instructions: 1
Number of characters of plain text: 29
This generic program of Example 19-4 works on any
well-formed XML document. Most SAX programs are more specific and
only work with certain XML applications. They look for particular
elements or attributes in particular places and respond to them
accordingly. They may rely on patterns that are enforced by a
validating parser. Still, this behavior comprises the fundamentals of
SAX.
The complicated
part of most SAX programs is the data structure you must build to
store information returned by the parser until
you're ready to use it. Sometimes this information
can be as complicated as the XML document itself, in which case you
may be better off using DOM, which at least provides a ready-made
data structure for an XML document. You usually want only some
information, though, and the data structure you construct should be
less complex than the document itself.
 |  |  | | 18.7. A Simple DOM Application |  | 19.2. SAX Features and Properties |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|