Gotcha! (Java & XML, 2nd Edition)

5.4. Gotcha!

As in previous chapters, I want to revisit some of the common pitfalls for new XML Java developers. In this chapter, I have focused on the Document Object Model, and this section continues that emphasis. Although some of the points made here are more informational than directly affective on your programming, they can be helpful in making design decisions about when to use DOM, and instrumental in understanding what is going on under the hood of your XML applications.

5.4.1. Memory, Performance, and Deferred DOMs

Earlier, I described the reasons to use DOM or SAX. Although I emphasized that using the DOM requires that the entire XML document be read into memory and stored in a tree structure, enough cannot be said on the subject. All too common is the scenario where a developer loads up his extensive collection of complex XML documents into an XSLT processor and begins a series of offline transformations, leaving the process to grab a bite to eat. Upon returning, he finds that his Windows machine is showing the dreaded "blue screen of death" and his Linux box is screaming about memory problems. For this developer and the hundreds like him, beware the DOM for excessively large data!

Using the DOM requires an amount of memory proportional to the size and complexity of an XML document. However, you should dig a bit further into your parser's documentation. Often, today's parsers contain a feature modeled on what it typically called a deferred DOM . A deferred DOM tries to lower the memory cost of using DOM by not reading and allocating all information needed by a DOM node until that node is requested. Until that time, the nodes in existence, but not in use, are simply nulled out. This reduces the memory overhead for large documents when only a specific portion of the document must be processed. However, realize that with this decrease in memory, there is an increase in processing. Since nodes are not in memory, and must be filled with data when requested, there is generally more lag time when a node not previously accessed is requested. It's a tradeoff. However, a deferred DOM can often help save the day when dealing with large documents.

5.4.2. Polymorphism and the Node Interface

Previously in this chapter I stressed the tree model that DOM is built upon. I also told you that the key to this was a common interface, org.w3c.dom.Node. This class provides common functionality for all DOM classes, but sometimes it provides more. For example, this class defines a method called getNodeValue( ), which returns a String. Sounds like a good idea, right? Without having to cast the Node to a specific type, you can quickly get its value. However, things get a little sticky when you consider types like Element. Remember that an Element has no textual content, but instead has children of type Text. So an Element in DOM has no value that has any meaning; the result is that you get something like #ELEMENT#. The exact value is parser-dependent, but you get the idea.

The same situation applies to other methods on the Node interface, like getNodeName( ). For Text nodes, you get #TEXT#, which doesn't help too much. So what exactly is the gotcha here? You simply need to be careful when working with different DOM types through the Node interface. You may get some unexpected results along with the convenience of the common interface.

5.4.3. DOM Parsers Throwing SAX Exceptions

In this chapter's example of using DOM, I did not explicitly list the exceptions that could result from a document parse; instead a higher-level exception was caught. This was because, as I mentioned, the process of generating a DOM tree is left up to the parser implementation, and is not always the same. However, it is typically good practice to catch the specific exceptions that can occur and react to them differently, as the type of exception gives information about the problem that occurred. Rewriting the SerializerTest class's parser invocation this way might make a surprising facet of this process surface. For Apache Xerces this could be done as follows:

    public void test(String xmlDocument, String outputFilename) 
        throws Exception {

        try {
            File outputFile = new File(outputFilename);
            DOMParser parser = new DOMParser( );
            parser.parse(xmlDocument);
            Document doc = parser.getDocument( );
        } catch (IOException e) {
            System.out.println("Error reading URI: " + e.getMessage( ));
        } catch (SAXException e) {
            System.out.println("Error in parsing: " + e.getMessage( ));
        }

        // Serialize
        DOMSerializer serializer = new DOMSerializer( );
        serializer.serialize(doc, new File(outputFilename));
    }

The IOException seen here should not come as a surprise, as it signifies an error in locating the specified filename as it did in the earlier SAX examples. Something else from the SAX section might make you think something was amiss; did you notice the SAXException that can be thrown? The DOM parser throws a SAX exception? Surely I have imported the wrong set of classes! Not so; these are the right classes. Remember that it would be possible to build a tree structure of the data in an XML document yourself, using SAX, but the DOM provides an alternative. However, this does not preclude SAX from being used in that alternative! In fact, SAX provides a lightweight, fast way to parse a document; in this case, it just happens that as it is parsed, it is inserted into a DOM tree. Because no standard for the DOM creation exists, this is acceptable and not even uncommon. So don't be surprised or taken aback when you find yourself importing and catching org.xml.sax.SAXException in your DOM applications.