DOM Level 2 Modules (Java & XML, 2nd Edition)

Specification

Module name

Summary of purpose

DOM Level 2 Core

XML

Extends the DOM Level 1 specification; deals with basic DOM structures like Element, Attr, Document, etc.

DOM Level 2 Views

Views

Provides a model for scripts to dynamically update a DOM structure.

DOM Level 2 Events

Events

Defines an event model for programs and scripts to use in working with DOM.

DOM Level 2 Style

CSS

Provides a model for CSS (Cascading Style Sheets) based on the DOM Core and DOM Views specifications.

DOM Level 2 Traversal and Range

Traversal/Range

Defines extensions to the DOM for traversing a document and identifying the range of content within that document.

DOM Level 2 HTML

HTML

Extends the DOM to provide interfaces for dealing with HTML structures in a DOM format.

package javaxml2; import org.w3c.dom.DOMImplementation; public class DOMModuleChecker { /** Vendor DOMImplementation impl class */ private String vendorImplementationClass = "org.apache.xerces.dom.DOMImplementationImpl"; /** Modules to check */ private String[] moduleNames = {"XML", "Views", "Events", "CSS", "Traversal", "Range", "HTML"}; public DOMModuleChecker( ) { } public DOMModuleChecker(String vendorImplementationClass) { this.vendorImplementationClass = vendorImplementationClass; } public void check( ) throws Exception { DOMImplementation impl = (DOMImplementation)Class.forName(vendorImplementationClass) .newInstance( ); for (int i=0; i<moduleNames.length; i++) { if (impl.hasFeature(moduleNames[i], "2.0")) { System.out.println("Support for " + moduleNames[i] + " is included in this DOM implementation."); } else { System.out.println("Support for " + moduleNames[i] + " is not included in this DOM implementation."); } } } public static void main(String[] args) { if ((args.length != 0) && (args.length != 1)) { System.out.println("Usage: java javaxml2.DOMModuleChecker " + "[DOMImplementation impl class to query]"); System.exit(-1); } try { DOMModuleChecker checker = null; if (args.length == 1) { checker = new DOMModuleChecker(args[1]); } else { checker = new DOMModuleChecker( ); } checker.check( ); } catch (Exception e) { e.printStackTrace( ); } } }

C:\javaxml2\build>java javaxml2.DOMModuleChecker Support for XML is included in this DOM implementation. Support for Views is not included in this DOM implementation. Support for Events is included in this DOM implementation. Support for CSS is not included in this DOM implementation. Support for Traversal is included in this DOM implementation. Support for Range is not included in this DOM implementation. Support for HTML is not included in this DOM implementation.

6.3.2. Traversal

First up on the list is the DOM Level 2 Traversal module. This is intended to provide tree-walking capability, but also to allow you to refine the nature of that behavior. In the earlier section on DOM mutation, I mentioned that most of your DOM code will know something about the structure of a DOM tree being worked with; this allows for quick traversal and modification of both structure and content. However, for those times when you do not know the structure of the document, the traversal module comes into play.

Consider the auction site again, and the items input by the user. Most critical are the item name and the description. Since most popular auction sites provide some sort of search, you would want to provide the same in this fictional example. Just searching item titles isn't going to cut it in the real world; instead, a set of key words should be extracted from the item descriptions. I say key words because you don't want a search on "adirondack top" (which to a guitar lover obviously applies to the wood on the top of a guitar) to return toys ("top") from a particular mountain range ("Adirondack"). The best way to do this in the format discussed so far is to extract words that are formatted in a certain way. So the words in the description that are bolded, or in italics, are perfect candidates. Of course, you could grab all the nontextual child elements of the description element. However, you'd have to weed through links (the a element), image references (img), and so forth. What you really want is to specify a custom traversal. Good news; you're in the right place.

The whole of the traversal module is contained within the org.w3c.dom.traversal package. Just as everything within core DOM begins with a Document interface, everything in DOM Traversal begins with the org.w3c.dom.traversal.DocumentTraversal interface. This interface provides two methods:

NodeIterator createNodeIterator(Node root, int whatToShow, NodeFilter filter,
                                boolean expandEntityReferences);
TreeWalker createTreeWalker(Node root, int whatToShow, NodeFilter filter,
                            boolean expandEntityReferences);

Most DOM implementations that support traversal choose to have their org.w3c.dom.Document implementation class implement the DocumentTraversal interface as well; this is how it works in Xerces. In a nutshell, using a NodeIterator provides a list view of the elements it iterates over; the closest analogy is a standard Java List (in the java.util package). TreeWalker provides a tree view, which you may be more used to in working with XML by now.

6.3.2.1. NodeIterator

I want to get past all the conceptualization and into the code sample I referred to earlier. I want access to all content within the description of an item from the auction site that is within a specific set of formatting tags. To do this, I first need access to the DOM tree itself. Since this doesn't fit into the servlet approach (you probably wouldn't have a servlet building the search phrases, you'd have some standalone class), I need a new class, ItemSearcher (Example 6-5). This class takes any number of item files to search through as arguments.

Example 6-5. The ItemSearcher class

package javaxml2;

import java.io.File;

// DOM imports
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.traversal.DocumentTraversal;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.traversal.NodeIterator;

// Vendor parser
import org.apache.xerces.parsers.DOMParser;

public class ItemSearcher {

    private String docNS = "http://www.oreilly.com/javaxml2";

    public void search(String filename) throws Exception {
        // Parse into a DOM tree
        File file = new File(filename);
        DOMParser parser = new DOMParser( );
        parser.parse(file.toURL().toString( ));
        Document doc = parser.getDocument( );

        // Get node to start iterating with
        Element root = doc.getDocumentElement( );
        NodeList descriptionElements = 
            root.getElementsByTagNameNS(docNS, "description");
        Element description = (Element)descriptionElements.item(0);

        // Get a NodeIterator
        NodeIterator i = ((DocumentTraversal)doc)
            .createNodeIterator(description, NodeFilter.SHOW_ALL, null, true);

        Node n;
        while ((n = i.nextNode( )) != null) {
            if (n.getNodeType( ) == Node.ELEMENT_NODE) {
                System.out.println("Encountered Element: '" + 
                    n.getNodeName( ) + "'");
            } else if (n.getNodeType( ) == Node.TEXT_NODE) {
                System.out.println("Encountered Text: '" + 
                    n.getNodeValue( ) + "'");
            }
        }
    }

    public static void main(String[] args) {
        if (args.length == 0) {
            System.out.println("No item files to search through specified.");
            return;
        }

        try {
            ItemSearcher searcher = new ItemSearcher( );
            for (int i=0; i<args.length; i++) {
                System.out.println("Processing file: " + args[i]);
                searcher.search(args[i]);
            }
        } catch (Exception e) {
            e.printStackTrace( );
        }
    }
}

As you can see, I've created a NodeIterator, and supplied it the description element to start with for iteration. The constant value passed as the filter instructs the iterator to show all nodes. You could just as easily provide values like Node.SHOW_ELEMENT and Node.SHOW_TEXT, which would show only elements or textual nodes, respectively. I haven't yet provided a NodeFilter implementation (I'll get to that next), and I allowed for entity reference expansion. What is nice about all this is that the iterator, once created, doesn't have just the child nodes of description. Instead, it actually has all nodes under description, even when nested multiple levels deep. This is extremely handy for dealing with unknown XML structure!

At this point, you still have all the nodes, which is not what you want. I added some code (the last while loop) to show you how to print out the element and text node results. You can run the code as is, but it's not going to help much. Instead, the code needs to provide a filter, so it only picks up elements with the formatting desired: the text within an i or b block. You can provide this customized behavior by supplying a custom implementation of the NodeFilter interface, which defines only a single method:

public short acceptNode(Node n);

This method should return NodeFilter.FILTER_SKIP, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_ACCEPT. The first skips the examined node, but continues to iterate over its children; the second rejects the examined node and its children (only applicable in TreeWalker); and the third accepts and passes on the examined node. It behaves a lot like SAX, in that you can intercept nodes as they are being iterated and decide if they should be passed on to the calling method. Add the following nonpublic class to the ItemSearcher.java source file:

class FormattingNodeFilter implements NodeFilter {

    public short acceptNode(Node n) {
        if (n.getNodeType( ) == Node.TEXT_NODE) {
            Node parent = n.getParentNode( );
            if ((parent.getNodeName( ).equalsIgnoreCase("b")) ||
                (parent.getNodeName( ).equalsIgnoreCase("i"))) {
                return FILTER_ACCEPT;
            }
        }
        // If we got here, not interested
        return FILTER_SKIP;
    }
}

This is just plain old DOM code, and shouldn't pose any difficulty to you. First, the code only wants text nodes; the text of the formatted elements is desired, not the elements themselves. Next, the parent is determined, and since it's safe to assume that Text nodes have Element node parents, the code immediately invokes getNodeName( ). If the element name is either "b" or "i", the code has found search text, and returns FILTER_ACCEPT. Otherwise, FILTER_SKIP is returned.

All that's left now is a change to the iterator creation call instructing it to use the new filter implementation, and to the output, both in the existing search( ) method of the ItemSearcher class:

// Get a NodeIterator
NodeIterator i = ((DocumentTraversal)doc)
    .createNodeIterator(description, NodeFilter.SHOW_ALL, 
        new FormattingNodeFilter( ), true);

Node n;
while ((n = i.nextNode( )) != null) {
    System.out.println("Search phrase found: '" + n.getNodeValue( ) + "'");
}

NOTE: Some astute readers will wonder what happens when a NodeFilter implementation conflicts with the constant supplied to the createNodeIterator( ) method (in this case that constant is NodeFilter.SHOW_ALL). Actually, the short constant filter is applied first, and then the resulting list of nodes is passed to the filter implementation. If I had supplied the constant NodeFilter.SHOW_ELEMENT, I would not have gotten any search phrases, because my filter would not have received any Text nodes to examine; just Element nodes. Be careful to use the two together in a way that makes sense. In the example, I could have safely used NodeFilter.SHOW_TEXT also.

Now, the class is useful and ready to run. Executing it on the bourgOM.xml file I explained in the first section, I get the following results:

bmclaugh@GANDALF ~/javaxml2/build
$ java javaxml2.ItemSearcher ../ch06/xml/item-bourgOM.xml
Processing file: ../ch06/xml/item-bourgOM.xml
Search phrase found: 'beautiful'
Search phrase found: 'Sitka-topped'
Search phrase found: 'Indian Rosewood'
Search phrase found: 'huge sound'
Search phrase found: 'great action'
Search phrase found: 'fossilized ivory'
Search phrase found: 'ebony'
Search phrase found: 'great guitar'

This is perfect: all of the bolded and italicized phrases are now ready to be added to a search facility. (Sorry; you'll have to write that yourself!)

6.3.2.2. TreeWalker

The TreeWalker interface is almost exactly the same as the NodeIterator interface; the only difference is that you get a tree view instead of a list view. This is primarily useful if you want to deal with only a certain type of node within a tree; for instance, the tree with only elements or without any comments. By using the constant filter value (such as NodeFilter.SHOW_ELEMENT) and a filter implementation (like one that passes on FILTER_SKIP for all comments), you can essentially get a view of a DOM tree without extraneous information. The TreeWalker interface provides all the basic node operations, such as firstChild( ), parentNode( ), nextSibling( ), and of course getCurrentNode( ), which tells you where you are currently walking.

I'm not going to give an example here. By now, you should see that this is identical to dealing with a standard DOM tree, except that you can filter out unwanted items by using the NodeFilter constants. This is a great, simple way to limit your view of XML documents to only information you are interested in seeing. Use it well; it's a real asset, as is NodeIterator! You can also check out the complete specification online at http://www.w3.org/TR/DOM-Level-2-Traversal-Range/.

// Load document try { DOMParser parser = new DOMParser( ); parser.parse(xmlFile.toURL().toString( )); doc = parser.getDocument( ); Element root = doc.getDocumentElement( ); // Name of item NodeList nameElements = root.getElementsByTagNameNS(docNS, "name"); Element nameElement = (Element)nameElements.item(0); Text nameText = (Text)nameElement.getFirstChild( ); nameText.setData(name); // Description of item NodeList descriptionElements = root.getElementsByTagNameNS(docNS, "description"); Element descriptionElement = (Element)descriptionElements.item(0); // Remove and recreate descriptionRange range = ((DocumentRange)doc).createRange( ); range.setStartBefore(descriptionElement.getFirstChild( )); range.setEndAfter(descriptionElement.getLastChild( )); range.deleteContents( ); Text descriptionText = doc.createTextNode(description); descriptionElement.appendChild(descriptionText); range.detach( ); } catch (SAXException e) { // Print error PrintWriter out = res.getWriter( ); res.setContentType("text/html"); out.println("<HTML><BODY>Error in reading XML: " + e.getMessage( ) + ".</BODY></HTML>"); out.close( ); return; }

// Remove and recreate description Range range = ((DocumentRange)doc).createRange( ); range.setStartBefore(descriptionElement.getFirstChild( )); range.setEndAfter(descriptionElement.getLastChild( ));Node oldContents = range.extractContents( ); Text descriptionText = doc.createTextNode(description); descriptionElement.appendChild(descriptionText); // Set this as content to some other, archival, element archivalElement.appendChild(oldContents);

package javaxml2; import org.w3c.dom.views.AbstractView; public interface StyledView implements AbstractView { public void setStylesheet(String stylesheetURI); public String getStylesheetURI( ); }

6.3. DOM Level 2 Modules

6.3.1. Branching Out

Table 6-1. DOM specifications and purpose

Example 6-4. Checking features on a DOM implementation

6.3.2. Traversal

6.3.2.1. NodeIterator

Example 6-5. The ItemSearcher class

6.3.2.2. TreeWalker

6.3.3. Range

6.3.4. Events, Views, and Style

6.3.4.1. Events

6.3.4.2. Views

6.3.4.3. Style

6.3.5. HTML

6.3.6. Odds and Ends