Filters (XML in a Nutshell, 2nd Edition)

19.3. Filters

A SAX filter sits in between the parser and the client application and intercepts the messages that these two objects pass to each other. It can pass these messages unchanged or modify, replace, or block them. To a client application, the filter looks like a parser, that is, an XMLReader. To the parser, the filter looks like a client application, that is, a ContentHandler.

SAX filters are implemented by subclassing the org.xml.sax.helpers.XMLFilterImpl class.[8] This class implements all the required interfaces of SAX for both parsers and client applications. That is, its signature is as follows:

[8]There's also an org.xml.sax.XMLFilter interface. However, this interface is arranged exactly backwards for most use cases. It filters messages from the client application to the parser, but not the much more important messages from the parser to the client application. Furthermore, implementing the XMLFilter interface directly requires a lot more work than subclassing XMLFilterImpl. Almost no experienced SAX programmer would choose to implement XMLFilter directly rather than subclassing the XMLFilterImpl adapter class.

public class XMLFilterImpl implements XMLFilter, XMLReader,
 ContentHandler, DTDHandler, ErrorHandler

Your own filters will extend this class and override those methods that correspond to the messages you want to filter. For example, if you wanted to filter out all processing instructions, you would write a filter that would override the processingInstruction( ) method to do nothing, as shown in Example 19-5.

Example 19-5. A SAX filter that removes processing instructions

import org.xml.sax.helpers.XMLFilterImpl;

public class ProcessingInstructionStripper extends XMLFilterImpl {

  public void processingInstruction(String target, String data) {
    // Because we do nothing, processing instructions read in the
    // document are *not* passed to client application
  }

}

If instead you wanted to replace a processing instruction with an element whose name was the same as the processing instruction's target and whose text content was the processing instruction's data, you'd call the startElement( ), characters( ), and endElement( ) methods from inside the processingInstruction( ) method after filling in the arguments with the relevant data from the processing instruction, as shown in Example 19-6.

Example 19-6. A SAX filter that converts processing instructions to elements

import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class ProcessingInstructionConverter extends XMLFilterImpl {

  public void processingInstruction(String target, String data)
   throws SAXException {

    // AttributesImpl is an adapter class in the org.xml.sax.ext package
    // for precisely this case. We don't really want to add any attributes
    // here, but we need to pass something as the fourth argument to
    // startElement( ).
    Attributes emptyAttributes = new AttributesImpl( );

    // We won't use any namespace for the element
    startElement("", target, target, emptyAttributes);
    // converts String data to char array
    char[] text = data.toCharArray( );
    characters(text, 0, text.length);

    endElement("", target, target);

  }

}

We used this filter before passing Example 19-2 into a program that echoes an XML document onto System.out and were a little surprised to see this come out:

<xml-stylesheet>type="text/css" href="person.css"</xml-stylesheet>
<person xmlns="http://xml.oreilly.com/person">
  <name:name xmlns:name="http://xml.oreilly.com/name">
    <name:first>Sydney</name:first>
    <name:last>Lee</name:last>
  </name:name>
  <assignment project_id="p2"></assignment>
</person>

This document is not well-formed! The specific problem is that there are two independent root elements. However, on further consideration that's really not too surprising. Well-formedness checking is normally done by the underlying parser when it reads the text form of an XML document. SAX filters should but are not absolutely required to provide well-formed XML data to client applications. Indeed, they can produce substantially more malformed data than this by including start-tags that are not matched by end-tags, text that contains illegal characters such as the formfeed or the vertical tab, and XML names that contain non-name characters such as * and §. You need to be very careful before assuming data you receive from a filter is valid or well-formed.

If you want to invoke a method without filtering it or you want to invoke the same method in the underlying handler, you can prefix a call to it with the super keyword. This invokes the variant of the method from the superclass. By default, each method in XMLFilterImpl just passes the same arguments to the equivalent method in the parent handler. Example 19-7 demonstrates with a filter that changes all character data to uppercase by overriding the characters( ) method.

Example 19-7. A SAX filter that converts text to uppercase

import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class UpperCaseFilter extends XMLFilterImpl {

  public void characters(char[] text, int start, int length)
   throws SAXException {

    String temp = new String(text, start, length);
    temp = temp.toUpperCase( );
    text = temp.toCharArray( );
    super.characters(text, 0, text.length);

  }

}

Actually, using a filter involves these steps:

Create a filter object, normally by invoking its own constructor.

Create the XMLReader that will actually parse the document, normally by calling XMLReaderFactory.createXMLReader( ).
Attach the filter to the parser using the filter's setParent( ) method.
Install a ContentHandler in the filter.
Parse the document by calling the filter's parse( ) method.

Details can vary a little from application to application. For instance, you might install other handlers besides the ContentHandler or change the parent between documents. However, once the filter has been attached to the underlying XMLReader, you should not directly invoke any methods on this underlying parser; you should only talk to it through the filter. For example, this is how you'd use the filter in Example 19-7 to parse a document:

XMLFilter filter = new UpperCaseFilter( );
filter.setParent(XMLReaderFactory.createXMLReader( ));
filter.setContentHandler(yourContentHandlerObject);
filter.parse(document);

Notice specifically that you invoke the filter's parse( ) method, not the underlying parser's parse( ) method.