Simple API for XML (SAX) (XML in a Nutshell, 2nd Edition)

19.1. The ContentHandler Interface

ContentHandler, shown in stripped-down form in Example 19-1, is an interface in the org.xml.sax package. You implement this interface in a class of your own devising. Next, you configure an XMLReader with an instance of your implementation. As the XMLReader reads the document, it invokes the methods in your object to tell your program what's in the XML document. You can respond to these method invocations in any way you see fit.

TIP: This class has no relation to the moribund java.net.ContentHandler class. However, you may encounter a name conflict if you import both java.net.* and org.xml.sax.* in the same class. It's better to import just the java.net classes you actually need, rather than the entire package.

Example 19-1. The org.xml.sax.ContentHandler Interface

package org.xml.sax;

public interface ContentHandler {
    public void setDocumentLocator(Locator locator);
    public void startDocument( ) throws SAXException;
    public void endDocument( ) throws SAXException;
    public void startPrefixMapping(String prefix, String uri)
     throws SAXException;
    public void endPrefixMapping(String prefix) throws SAXException;
    public void startElement(String namespaceURI, String localName,
     String qualifiedName, Attributes atts) throws SAXException;
    public void endElement(String namespaceURI, String localName,
     String qualifiedName) throws SAXException;
    public void characters(char[] text, int start, int length)
     throws SAXException;
    public void ignorableWhitespace(char[] text, int start, int length)
     throws SAXException;
    public void processingInstruction(String target, String data)
     throws SAXException;
    public void skippedEntity(String name) throws SAXException;

}

Every time the XMLReader reads a piece of the document, it calls a method in its ContentHandler. Suppose a parser reads the simple document shown in Example 19-2.

Example 19-2. A simple XML document

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>

The parser will call these methods in its ContentHandler with these arguments in this order. The values of the arguments passed to each method are given after each method name:

setDocumentLocator(Locator locator)
locator: org.apache.xerces.readers.DefaultEntityHandler@1f953d

```
startDocument( )
```

processingInstruction(String target, String data)
target: "xml-stylesheet"
data: "type='text/css' href='person.css'"

startPrefixMapping(String prefix, String namespaceURI)
prefix: ""
namespaceURI: "http://xml.oreilly.com/person"

startElement(String namespaceURI, String localName, 
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/person"
localName: "person"
qualifiedName: "person"
atts: {} (no attributes, an empty list)

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 181
length: 3

startPrefixMapping(String prefix, String uri)
prefix: "name"
uri: "http://xml.oreilly.com/name")

startElement(String namespaceURI, String localName, 
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "name"
qualifiedName: "name:name"
atts: {} (no attributes, an empty list)

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 236
length: 5

startElement(String namespaceURI, String localName, 
String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "first"
qualifiedName: "name:first"
atts: {} (no attributes, an empty list)

characters(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 253
length: 6

endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "first"
qualifiedName: "name:first"

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 272
length: 5

startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/name"
localName: "last"
qualifiedName: "name:last"
atts: {} (no attributes, an empty list)

characters(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 288
length: 3

endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "last"
qualifiedName: "name:last"

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 303
length: 3

endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/name"
localName: "name"
qualifiedName: "name:name"

endPrefixMapping(String prefix)
prefix: "name"

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 318
length: 3

startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts)
namespaceURI: "http://xml.oreilly.com/person"
localName: "assignment"
qualifiedName: "assignment
atts: {project_id="p2"}

endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/person"
localName: "assignment"
qualifiedName: "assignment"

ignorableWhitespace(char[] text, int start, int length)
text: <?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type='text/css' href='person.css'?>
<!DOCTYPE person SYSTEM "person.dtd">
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"/>
</person>
start: 350
length: 1

endElement(String namespaceURI, String localName, String qualifiedName)
namespaceURI: "http://xml.oreilly.com/person"
localName: "person"
qualifiedName: "person"

endPrefixMapping(String prefix)
prefix: ""

```
endDocument( )
```

Some pieces of this are not deterministic. Note that the char array passed to each call to characters( ) and ignorableWhitespace( ) actually contains the entire document! The specific text block that the parser really returns is indicated by the second two arguments. This is an optimization that Xerces-J performs. Other parsers are free to pass different char arrays as long as they set the start and length arguments to match. Indeed, the parser is also free to split a long run of plain text across multiple calls to characters( ) or ignorableWhitespace( ), so you cannot assume that these methods necessarily return the longest possible contiguous run of plain text. Other details that may change from parser to parser include attribute order within a tag and whether a Locator object is provided by calling setDocumentLocator( ).

Suppose you want to count the number of elements, attributes, processing instructions, and characters of plain text that exist in a given XML document. To do so, first write a class that implements the ContentHandler interface. The current count of each of the four items of interest is stored in a field. The field values are initialized to zero in the startDocument( ) method, which is called exactly once for each document parsed. Each callback method in the class increments the relevant field. The endDocument( ) method reports the total for that document. Example 19-3 is such a class.

Example 19-3. The XMLCounter ContentHandler

import org.xml.sax.*;

public class XMLCounter implements ContentHandler {

  private int numberOfElements;
  private int numberOfAttributes;
  private int numberOfProcessingInstructions;
  private int numberOfCharacters;

  public void startDocument( ) throws SAXException {
    numberOfElements = 0;
    numberOfAttributes = 0;
    numberOfProcessingInstructions = 0;
    numberOfCharacters = 0;
  }

  // We should count either the start-tag of the element or the end-tag,
  // but not both. Empty elements are reported by each of these methods.
  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException {
    numberOfElements++;
    numberOfAttributes += atts.getLength( );
  }

  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException {}

  public void characters(char[] text, int start, int length)
   throws SAXException {
    numberOfCharacters += length;
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    numberOfCharacters += length;
  }

  public void processingInstruction(String target, String data)
   throws SAXException {
    numberOfProcessingInstructions++;
  }

  // Now that the document is done, we can print out the final results
  public void endDocument( ) throws SAXException {
    System.out.println("Number of elements: " + numberOfElements);
    System.out.println("Number of attributes: " + numberOfAttributes);
    System.out.println("Number of processing instructions: "
     + numberOfProcessingInstructions);
    System.out.println("Number of characters of plain text: "
     + numberOfCharacters);
  }

  // Do-nothing methods we have to implement only to fulfill
  // the interface requirements:
  public void setDocumentLocator(Locator locator) {}
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {}
  public void endPrefixMapping(String prefix) throws SAXException {}
  public void skippedEntity(String name) throws SAXException {}

}

TIP: This class needs to override most methods in the ContentHandler interface. However, if you only really want to provide one or two ContentHandler methods, you may want to subclass the DefaultHandler class instead. This adapter class implements all methods in the ContentHandler interface with do-nothing methods, so you only have to override methods in which you're genuinely interested.

Next, build an XMLReader, and configure it with an instance of this class. Finally, parse the documents you want to count, as in Example 19-4.

Example 19-4. The DocumentStatistics driver class

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.IOException;

public class DocumentStatistics {

  public static void main(String[] args) {

    XMLReader parser;
    try {
     parser = XMLReaderFactory.createXMLReader( );
    }
    catch (SAXException e) {
      // fall back on Xerces parser by name
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;
      }
    }

    if (args.length == 0) {
      System.out.println(
       "Usage: java DocumentStatistics URL1 URL2...");
    }

    // Install the Content Handler
    parser.setContentHandler(new XMLCounter( ));

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage( )
         + " at line " + e.getLineNumber( )
         + ", column " + e.getColumnNumber( ));
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage( ));
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Running the program in Example 19-4 across the document in Example 19-2 results in the following output:

D:\books\xian\examples\18>java DocumentStatistics 18-2.xml
Number of elements: 5
Number of attributes: 1
Number of processing instructions: 1
Number of characters of plain text: 29

This generic program of Example 19-4 works on any well-formed XML document. Most SAX programs are more specific and only work with certain XML applications. They look for particular elements or attributes in particular places and respond to them accordingly. They may rely on patterns that are enforced by a validating parser. Still, this behavior comprises the fundamentals of SAX.

The complicated part of most SAX programs is the data structure you must build to store information returned by the parser until you're ready to use it. Sometimes this information can be as complicated as the XML document itself, in which case you may be better off using DOM, which at least provides a ready-made data structure for an XML document. You usually want only some information, though, and the data structure you construct should be less complex than the document itself.

Chapter 19. Simple API for XML (SAX)

Contents:

19.1. The ContentHandler Interface

Example 19-1. The org.xml.sax.ContentHandler Interface

Example 19-2. A simple XML document

Example 19-3. The XMLCounter ContentHandler

Example 19-4. The DocumentStatistics driver class