home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Book HomeJava and XML, 2nd EditionSearch this book

6.3. DOM Level 2 Modules

Now that you've seen what the DOM and the Level 2 core offering provide, I will talk about some additions to DOM Level 2. These are the various modules that add functionality to the core. They are useful from time to time, in certain DOM applications.

First, though, you must have a DOM Level 2 parser available. If you are using a parser that you have purchased or downloaded on your own, this is pretty easy. For example, you can go to the Apache XML web site at http://xml.apache.org, download the latest version of Xerces, and you've got DOM Level 2. However, if you're using a parser bundled with another technology, things can get a little trickier. For example, if you've got Jakarta's Tomcat servlet engine, you will find xml.jar and parser.jar in the lib/ directory and in the Tomcat classpath. This isn't so good, as these are DOM Level 1 implementations and won't support many of the features I talk about in this section; in that case, download a DOM Level 2 parser manually and ensure that it is loaded before any DOM Level 1 parsers.

WARNING: Beware of the newer versions of Tomcat. They do something ostensibly handy: load all jar files in the lib/ directory at startup. Unfortunately, because this is done alphabetically, putting xerces.jar in the lib/ directory means that parser.jar, a DOM Level 1 parser, will still be loaded first and you won't get DOM Level 2 support. A common trick to solve this problem is to rename the files: parser.jar becomes z_parser.jar, and xml.jar becomes z_xml.jar. This causes them to be loaded after Xerces, and then you will get DOM Level 2 support. This is the problem I mentioned earlier in the servlet example.

Once you've got a capable parser, you're ready to go. Before diving into the new modules, though, I want to show you a high-level overview of what these modules are all about.

6.3.1. Branching Out

When the DOM Level 1 specification came out, it was a single specification. It was defined basically as you read in Chapter 5, "DOM", with a few minor exceptions. However, when activity began on DOM Level 2, a whole slew of specifications resulted, each called a module. If you take a look at the complete set of DOM Level 2 specifications, you'll see six different modules listed. Seems like a lot, doesn't it? I'm not going to cover all of these modules; you'd be reading about DOM for the next four or five chapters. However, I will give you the rundown on the purpose of each module, summarized in Table 6-1. I've included the module's specification, name, and purpose, which you'll need to use shortly.

Table 6-1. DOM specifications and purpose


Module name

Summary of purpose

DOM Level 2 Core


Extends the DOM Level 1 specification; deals with basic DOM structures like Element, Attr, Document, etc.

DOM Level 2 Views


Provides a model for scripts to dynamically update a DOM structure.

DOM Level 2 Events


Defines an event model for programs and scripts to use in working with DOM.

DOM Level 2 Style


Provides a model for CSS (Cascading Style Sheets) based on the DOM Core and DOM Views specifications.

DOM Level 2 Traversal and Range


Defines extensions to the DOM for traversing a document and identifying the range of content within that document.

DOM Level 2 HTML


Extends the DOM to provide interfaces for dealing with HTML structures in a DOM format.

If views, events, CSS, HTML, and traversal were all in a single specification, nothing would ever get done at the W3C! To facilitate all of this moving along, and yet not hamstringing the DOM in the process, the different concepts were broken up into separate specifications.

Once you figure out which specifications to use, you're almost ready to roll. A DOM Level 2 parser is not required to support each of these specifications; as a result, you need to verify that the features you want to use are present in your XML parser. Happily, this is fairly simple to accomplish. Remember the hasFeature( ) method I showed you on the DOMImplementation class? Well, if you supply it a module name and version, it will let you know if the module and feature requested are supported. Example 6-4 is a small program that queries an XML parser's support for the DOM modules listed in Table 6-1. You will need to change the name of your vendor's DOMImplementation implementation class, but other than that adjustment, it should work for any parser.

Example 6-4. Checking features on a DOM implementation

package javaxml2;

import org.w3c.dom.DOMImplementation;

public class DOMModuleChecker {

    /** Vendor DOMImplementation impl class */
    private String vendorImplementationClass =
    /** Modules to check */
    private String[] moduleNames =
        {"XML", "Views", "Events", "CSS", "Traversal", "Range", "HTML"};

    public DOMModuleChecker( ) {

    public DOMModuleChecker(String vendorImplementationClass) {
        this.vendorImplementationClass = vendorImplementationClass;
    public void check( ) throws Exception {
        DOMImplementation impl = 
                                    .newInstance( );
        for (int i=0; i<moduleNames.length; i++) {
            if (impl.hasFeature(moduleNames[i], "2.0")) {
                System.out.println("Support for " + moduleNames[i] +
                    " is included in this DOM implementation.");
            } else {
                System.out.println("Support for " + moduleNames[i] +
                    " is not included in this DOM implementation.");                

    public static void main(String[] args) {
        if ((args.length != 0) && (args.length != 1)) {
            System.out.println("Usage: java javaxml2.DOMModuleChecker " +
                "[DOMImplementation impl class to query]");
        try {
            DOMModuleChecker checker = null;
            if (args.length == 1) {
                checker = new DOMModuleChecker(args[1]);
            } else {
                checker = new DOMModuleChecker( );
            checker.check( );
        } catch (Exception e) {
            e.printStackTrace( );

Running this program with xerces.jar in my classpath, I got the following output:

C:\javaxml2\build>java javaxml2.DOMModuleChecker
Support for XML is included in this DOM implementation.
Support for Views is not included in this DOM implementation.
Support for Events is included in this DOM implementation.
Support for CSS is not included in this DOM implementation.
Support for Traversal is included in this DOM implementation.
Support for Range is not included in this DOM implementation.
Support for HTML is not included in this DOM implementation.

By specifying the DOMImplementation implementation class for your vendor, you can check the supported modules in your own DOM parser. In the next few subsections, I will address a few of the modules that I've found useful, and that you will want to know about as well.

6.3.2. Traversal

First up on the list is the DOM Level 2 Traversal module. This is intended to provide tree-walking capability, but also to allow you to refine the nature of that behavior. In the earlier section on DOM mutation, I mentioned that most of your DOM code will know something about the structure of a DOM tree being worked with; this allows for quick traversal and modification of both structure and content. However, for those times when you do not know the structure of the document, the traversal module comes into play.

Consider the auction site again, and the items input by the user. Most critical are the item name and the description. Since most popular auction sites provide some sort of search, you would want to provide the same in this fictional example. Just searching item titles isn't going to cut it in the real world; instead, a set of key words should be extracted from the item descriptions. I say key words because you don't want a search on "adirondack top" (which to a guitar lover obviously applies to the wood on the top of a guitar) to return toys ("top") from a particular mountain range ("Adirondack"). The best way to do this in the format discussed so far is to extract words that are formatted in a certain way. So the words in the description that are bolded, or in italics, are perfect candidates. Of course, you could grab all the nontextual child elements of the description element. However, you'd have to weed through links (the a element), image references (img), and so forth. What you really want is to specify a custom traversal. Good news; you're in the right place.

The whole of the traversal module is contained within the org.w3c.dom.traversal package. Just as everything within core DOM begins with a Document interface, everything in DOM Traversal begins with the org.w3c.dom.traversal.DocumentTraversal interface. This interface provides two methods:

NodeIterator createNodeIterator(Node root, int whatToShow, NodeFilter filter,
                                boolean expandEntityReferences);
TreeWalker createTreeWalker(Node root, int whatToShow, NodeFilter filter,
                            boolean expandEntityReferences);

Most DOM implementations that support traversal choose to have their org.w3c.dom.Document implementation class implement the DocumentTraversal interface as well; this is how it works in Xerces. In a nutshell, using a NodeIterator provides a list view of the elements it iterates over; the closest analogy is a standard Java List (in the java.util package). TreeWalker provides a tree view, which you may be more used to in working with XML by now. NodeIterator

I want to get past all the conceptualization and into the code sample I referred to earlier. I want access to all content within the description of an item from the auction site that is within a specific set of formatting tags. To do this, I first need access to the DOM tree itself. Since this doesn't fit into the servlet approach (you probably wouldn't have a servlet building the search phrases, you'd have some standalone class), I need a new class, ItemSearcher (Example 6-5). This class takes any number of item files to search through as arguments.

Example 6-5. The ItemSearcher class

package javaxml2;

import java.io.File;

// DOM imports
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.traversal.DocumentTraversal;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.traversal.NodeIterator;

// Vendor parser
import org.apache.xerces.parsers.DOMParser;

public class ItemSearcher {

    private String docNS = "http://www.oreilly.com/javaxml2";

    public void search(String filename) throws Exception {
        // Parse into a DOM tree
        File file = new File(filename);
        DOMParser parser = new DOMParser( );
        parser.parse(file.toURL().toString( ));
        Document doc = parser.getDocument( );

        // Get node to start iterating with
        Element root = doc.getDocumentElement( );
        NodeList descriptionElements = 
            root.getElementsByTagNameNS(docNS, "description");
        Element description = (Element)descriptionElements.item(0);

        // Get a NodeIterator
        NodeIterator i = ((DocumentTraversal)doc)
            .createNodeIterator(description, NodeFilter.SHOW_ALL, null, true);

        Node n;
        while ((n = i.nextNode( )) != null) {
            if (n.getNodeType( ) == Node.ELEMENT_NODE) {
                System.out.println("Encountered Element: '" + 
                    n.getNodeName( ) + "'");
            } else if (n.getNodeType( ) == Node.TEXT_NODE) {
                System.out.println("Encountered Text: '" + 
                    n.getNodeValue( ) + "'");

    public static void main(String[] args) {
        if (args.length == 0) {
            System.out.println("No item files to search through specified.");

        try {
            ItemSearcher searcher = new ItemSearcher( );
            for (int i=0; i<args.length; i++) {
                System.out.println("Processing file: " + args[i]);
        } catch (Exception e) {
            e.printStackTrace( );

As you can see, I've created a NodeIterator, and supplied it the description element to start with for iteration. The constant value passed as the filter instructs the iterator to show all nodes. You could just as easily provide values like Node.SHOW_ELEMENT and Node.SHOW_TEXT, which would show only elements or textual nodes, respectively. I haven't yet provided a NodeFilter implementation (I'll get to that next), and I allowed for entity reference expansion. What is nice about all this is that the iterator, once created, doesn't have just the child nodes of description. Instead, it actually has all nodes under description, even when nested multiple levels deep. This is extremely handy for dealing with unknown XML structure!

At this point, you still have all the nodes, which is not what you want. I added some code (the last while loop) to show you how to print out the element and text node results. You can run the code as is, but it's not going to help much. Instead, the code needs to provide a filter, so it only picks up elements with the formatting desired: the text within an i or b block. You can provide this customized behavior by supplying a custom implementation of the NodeFilter interface, which defines only a single method:

public short acceptNode(Node n);

This method should return NodeFilter.FILTER_SKIP, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_ACCEPT. The first skips the examined node, but continues to iterate over its children; the second rejects the examined node and its children (only applicable in TreeWalker); and the third accepts and passes on the examined node. It behaves a lot like SAX, in that you can intercept nodes as they are being iterated and decide if they should be passed on to the calling method. Add the following nonpublic class to the ItemSearcher.java source file:

class FormattingNodeFilter implements NodeFilter {

    public short acceptNode(Node n) {
        if (n.getNodeType( ) == Node.TEXT_NODE) {
            Node parent = n.getParentNode( );
            if ((parent.getNodeName( ).equalsIgnoreCase("b")) ||
                (parent.getNodeName( ).equalsIgnoreCase("i"))) {
                return FILTER_ACCEPT;
        // If we got here, not interested
        return FILTER_SKIP;

This is just plain old DOM code, and shouldn't pose any difficulty to you. First, the code only wants text nodes; the text of the formatted elements is desired, not the elements themselves. Next, the parent is determined, and since it's safe to assume that Text nodes have Element node parents, the code immediately invokes getNodeName( ). If the element name is either "b" or "i", the code has found search text, and returns FILTER_ACCEPT. Otherwise, FILTER_SKIP is returned.

All that's left now is a change to the iterator creation call instructing it to use the new filter implementation, and to the output, both in the existing search( ) method of the ItemSearcher class:

// Get a NodeIterator
NodeIterator i = ((DocumentTraversal)doc)
    .createNodeIterator(description, NodeFilter.SHOW_ALL, 
        new FormattingNodeFilter( ), true);

Node n;
while ((n = i.nextNode( )) != null) {
    System.out.println("Search phrase found: '" + n.getNodeValue( ) + "'");
NOTE: Some astute readers will wonder what happens when a NodeFilter implementation conflicts with the constant supplied to the createNodeIterator( ) method (in this case that constant is NodeFilter.SHOW_ALL). Actually, the short constant filter is applied first, and then the resulting list of nodes is passed to the filter implementation. If I had supplied the constant NodeFilter.SHOW_ELEMENT, I would not have gotten any search phrases, because my filter would not have received any Text nodes to examine; just Element nodes. Be careful to use the two together in a way that makes sense. In the example, I could have safely used NodeFilter.SHOW_TEXT also.

Now, the class is useful and ready to run. Executing it on the bourgOM.xml file I explained in the first section, I get the following results:

bmclaugh@GANDALF ~/javaxml2/build
$ java javaxml2.ItemSearcher ../ch06/xml/item-bourgOM.xml
Processing file: ../ch06/xml/item-bourgOM.xml
Search phrase found: 'beautiful'
Search phrase found: 'Sitka-topped'
Search phrase found: 'Indian Rosewood'
Search phrase found: 'huge sound'
Search phrase found: 'great action'
Search phrase found: 'fossilized ivory'
Search phrase found: 'ebony'
Search phrase found: 'great guitar'

This is perfect: all of the bolded and italicized phrases are now ready to be added to a search facility. (Sorry; you'll have to write that yourself!)

6.3.3. Range

The DOM Level 2 Range module is one of the least commonly used modules, probably due to a lack of understanding of DOM Range rather than any lack of usefulness. This module provides a way to deal with a set of content within a document. Once you've defined that range of content, you can insert into it, copy it, delete parts of it, and manipulate it in various ways. The most important thing to start with is realizing that "range" in this sense refers to a number of pieces of a DOM tree grouped together. It does not refer to a set of allowed values, where a high and low or start and end are defined. Therefore, DOM Range has nothing at all to do with validation of data values. Get that, and you're already ahead of the pack.

Like traversal, working with Range involves a new DOM package: org.w3c.dom.ranges. There are actually only two interfaces and one exception within this class, so it won't take you long to get your bearings. First is the analog to Document (and DocumentTraversal): that's org.w3c.dom.ranges.DocumentRange. Like the DocumentTraversal class, Xerces' Document implementation class implements Range. And also like DocumentTraversal, it has very few interesting methods; in fact, only one:

public Range createRange( );

All other range operations operate upon the Range class (rather, an implementation of the interface; but you get the idea). Once you've got an instance of the Range interface, you can set the starting and ending points, and edit away. As an example, let's go back to the UpdateItemServlet . I mentioned that it's a bit of a hassle to try and remove all the children of the description element and then set the new description text; that's because there is no way to tell if a single Text node is within the description, or if many elements and text nodes, as well as nested nodes, exist within a description that is primarily HTML. I showed you how to simply remove the old description element and create a new one. However, DOM Range makes this unnecessary. Take a look at this modification to the doPost( ) method of that servlet:

            // Load document
            try {
                DOMParser parser = new DOMParser( );
                parser.parse(xmlFile.toURL().toString( ));
                doc = parser.getDocument( );

                Element root = doc.getDocumentElement( );
                // Name of item
                NodeList nameElements = 
                    root.getElementsByTagNameNS(docNS, "name");
                Element nameElement = (Element)nameElements.item(0);
                Text nameText = (Text)nameElement.getFirstChild( );
                // Description of item
                NodeList descriptionElements = 
                    root.getElementsByTagNameNS(docNS, "description");
                Element descriptionElement = (Element)descriptionElements.item(0);

                // Remove and recreate description
                Range range = ((DocumentRange)doc).createRange( );
                range.setStartBefore(descriptionElement.getFirstChild( ));
                range.setEndAfter(descriptionElement.getLastChild( ));
                range.deleteContents( );
                Text descriptionText = doc.createTextNode(description);

                range.detach( );
            } catch (SAXException e) {
                // Print error
                PrintWriter out = res.getWriter( );
                out.println("<HTML><BODY>Error in reading XML: " +
                    e.getMessage( ) + ".</BODY></HTML>");
                out.close( ); 

To remove all the content, I first create a new Range, using the DocumentRange cast. You'll need to add import statements for the DocumentRange and Range classes to your servlet, too (they are both in the org.w3c.dom.ranges package).

NOTE: In the first part of the DOM Level 2 Modules section, I showed you how to check which modules a parser implementation supports. I realize that Xerces reported that it did not support Range. However, running this code with Xerces 1.3.0, 1.3.1, and 1.4 all worked without a hitch. Strange, isn't it?

Once the range is ready, set the starting and ending points. Since I want all content within the description element, I start before the first child of that Element node (using setStartBefore( )), and end after its last child (using setEndAfter( )). There are other, similar methods for this task, setStartAfter( ) and setEndBefore( ). Once that's done, it's simple to call deleteContents( ). Just like that, not a bit of content is left. Then the servlet creates the new textual description and appends it. Finally, I let the JVM know that it can release any resources associated with the Range by calling detach( ). While this step is commonly overlooked, it can really help with lengthy bits of code that use the extra resources.

Another option is to use extractContents( ) instead of deleteContents( ). This method removes the content, then returns the content that has been removed. You could insert this as an archived element, for example:

// Remove and recreate description
Range range = ((DocumentRange)doc).createRange( );
range.setStartBefore(descriptionElement.getFirstChild( ));
range.setEndAfter(descriptionElement.getLastChild( ));
Node oldContents = range.extractContents( );
Text descriptionText = doc.createTextNode(description);

// Set this as content to some other, archival, element

Don't try this in your servlet; there is no archivalElement in this code, and it is just for demonstration purposes. However, it should be starting to sink in that the DOM Level 2 Range module can really help you in editing documents' contents. It also provides yet another way to get a handle on content when you aren't sure of the structure of that content ahead of time.

There's a lot more to ranges in DOM; check this out on your own, along with all of the DOM modules covered in this chapter. However, you should now have enough of an understanding of the basics to get you going. Most importantly, realize that at any point in an active Range instance, you can simply invoke range.insertNode(Node newNode) and add new content, wherever you are in a document! It is this robust editing quality of ranges that make them so attractive. The next time you need to delete, copy, extract, or add content to a structure that you know little about, think about using ranges. The specification gives you information on all this and more, and is located online at http://www.w3.org/TR/DOM-Level-2-Traversal-Range/.

6.3.4. Events, Views, and Style

Aside from the HTML module, which I'll talk about next, there are three other DOM Level 2 modules: Events, Views, and Style. I'm not going to cover these three in depth in this book, largely because I believe that they are more useful for client programming. So far, I've focused on server-side programming, and I'm going to keep in that vein throughout the rest of the book. These three modules are most often used on client software such as IDEs, web pages, and the like. Still, I want to briefly touch on each so you'll still be on top of the DOM heap at the next alpha-geek soirée. Events

The Events module provides just what you are probably expecting: a means of "listening" to a DOM document. The relevant classes are in the org.w3c.dom.events package, and the class that gets things going is DocumentEvent. No surprise here; compliant parsers (like Xerces) implement this interface in the same class that implements org.w3c.dom.Document. The interface defines only one method:

public Event createEvent(String eventType);

The string passed in is the type of event; valid values in DOM Level 2 are "UIEvent", "MutationEvent", and "MouseEvent". Each of these has a corresponding class: UIEvent, MutationEvent, and MouseEvent. You'll note, in looking at the Xerces Javadoc, that they provide only the MutationEvent interface, which is the only event type Xerces supports. When an event is "fired" off, it can be handled (or "caught") by an EventListener.

This is where the DOM core support comes in; a parser supporting DOM events should have the org.w3c.dom.Node interface implementing the org.w3c.dom.events.EventTarget interface. So every node can be the target of an event. This means that you have the following method available on those nodes:

public void addEventListener(String type, EventListener listener, 
                             boolean capture);

Here's the process. You create a new EventListener (which is a custom class you would write) implementation. You need to implement only a single method:

public void handleEvent(Event event);

Register that listener on any and all nodes you want to work with. Code in here typically does some useful task, like emailing users that their information has been changed (in some XML file), revalidating the XML (think XML editors), or asking users if they are sure they want to perform the action.

At the same time, you'll want your code to trigger a new Event on certain actions, like the user clicking on a node in an IDE and entering new text, or deleting a selected element. When the Event is triggered, it is passed to the available EventListener instances, starting with the active node and moving up. This is where your listener's code executes, if the event types are the same. Additionally, you can have the event stop propagating at that point (once you've handled it), or bubble up the event chain and possibly be handled by other registered listeners.

So there you have it; events in only a page! And you thought specifications were hard to read. Seriously, this is some useful stuff, and if you are working with client-side code, or software that will be deployed standalone on user's desktops (like that XML editor I keep talking about), this should be a part of your DOM toolkit. Check out the full specification online at http://www.w3.org/TR/DOM-Level-2-Events/. Views

Next on the list is DOM Level 2 Views. The reason I don't cover views in much detail is that, really, there is very little to be said. From every reading I can make of the (one-page!) specification, it's simply a basis for future work, perhaps in vertical markets. The specification defines only two interfaces, both in the org.w3c.dom.views package. Here's the first:

package org.w3c.dom.views;

public interface AbstractView {
    public DocumentView getDocument( );


And here's the second:

package org.w3c.dom.views;

public interface DocumentView {
    public AbstractView getDefaultView( );


Seems a bit cyclical, doesn't it? A single source document (a DOM tree) can have multiple views associated with it. In this case, view refers to a presentation, like a styled document (after XSL or CSS has been applied), or perhaps a version with Shockwave and one without. By implementing the AbstractView interface, you can define your own customized versions of displaying a DOM tree. For example, consider this example subinterface:

package javaxml2;

import org.w3c.dom.views.AbstractView;

public interface StyledView implements AbstractView {

    public void setStylesheet(String stylesheetURI);

    public String getStylesheetURI( );    

I've left out the method implementations, but you can see how this could be used to provide stylized views of a DOM tree. Additionally, a compliant parser implementation would have the org.w3c.dom.Document implementation implement DocumentView, which allows you to query a document for its default view. It's expected that in a later version of the specification you will be able to register multiple views for a document, and more closely tie a view or views to a document.

Look for this to be fleshed out more as browsers like Netscape, Mozilla, and Internet Explorer provide these sorts of views of XML. Additionally, you can read the short specification and know as much as I do by checking it out online at http://www.w3.org/TR/DOM-Level-2-Views/.

6.3.5. HTML

For HTML, DOM provides a set of interfaces that model the various HTML elements. For example, you can use the HTMLDocument class, the HTMLAnchorElement, and the HTMLSelectElement (all in the org.w3c.dom.html package) to represent their analogs in HTML (<HTML>, <A>, and <SELECT> in this case). All of these provide convenience methods like setTitle( ) (on HTMLDocument), setHref( ) (on HTMLAnchorElement), and getOptions( ) (on HTMLSelectElement). All of these extend core DOM structures like Document and Element, and so can be used as any other DOM Node could.

However, it turns out that the HTML bindings are rarely used (at least directly). It's not because they aren't useful; instead, many tools have already been written to provide this sort of access through even more user-friendly tools. XMLC, a project within the Enhydra application server framework, is one such example (located online at http://xmlc.enhydra.org), and Cocoon, covered in Chapter 10, "Web Publishing Frameworks", is another. These allow developers to work with HTML and web pages in a way that does not necessarily require even basic DOM knowledge, making it more accessible to web designers and newer Java developers. The end result of using these tools is that the HTML DOM bindings are rarely needed. But if you know about them, you can use them if you need to. Additionally, you can use standard DOM functionality on well-formed HTML documents (XHTML), treating elements as Element nodes and attributes as Attr nodes. Even without the HTML bindings, you can use DOM to work with HTML. Piece of cake.

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.