Serialization (Java & XML, 2nd Edition)

5.2.3. DOMSerializer

I've been throwing the term serialization around quite a bit, and should probably make sure you know what I mean. When I say serialization, I simply mean outputting the XML. This could be a file (using a Java File), an OutputStream, or a Writer. There are certainly more output forms available in Java, but these three cover most of the bases (in fact, the latter two do, as a File can be easily converted to a Writer, but accepting a File is a nice convenience feature). In this case, the serialization taking place is in an XML format; the DOM tree is converted back to a well-formed XML document in a textual format. It's important to note that the XML format is used, as you could easily code serializers to write HTML, WML, XHTML, or any other format. In fact, Apache Xerces provides these various classes, and I'll touch on them briefly at the end of this chapter.

5.2.3.1. Getting started

To get you past the preliminaries, Example 5-2 is the skeleton for the DOMSerializer class. It imports all the needed classes to get the code going, and defines the different entry points (for a File, OutputStream, and Writer) to the class. Two of these three methods simply defer to the third (with a little I/O magic). The example also sets up some member variables for the indentation to use, the line separator, and methods to modify those properties.

Example 5-2. The DOMSerializer skeleton

package javaxml2;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class DOMSerializer {

    /** Indentation to use */
    private String indent;

    /** Line separator to use */
    private String lineSeparator;

    public DOMSerializer( ) {
        indent = "";
        lineSeparator = "\n";
    }

    public void setLineSeparator(String lineSeparator) {
        this.lineSeparator = lineSeparator;
    }

    public void serialize(Document doc, OutputStream out)
        throws IOException {
        
        Writer writer = new OutputStreamWriter(out);
        serialize(doc, writer);
    }

    public void serialize(Document doc, File file)
        throws IOException {

        Writer writer = new FileWriter(file);
        serialize(doc, writer);
    }

    public void serialize(Document doc, Writer writer)
        throws IOException {

        // Serialize document
    }
}

Once this code is saved into a DOMSerializer.java source file, everything ends up in the version of the serialize( ) method that takes a Writer. Nice and tidy.

5.2.3.2. Launching serialization

With the setup in place for starting serialization, it's time to define the process of working through the DOM tree. One nice facet of DOM already mentioned is that all of the specific DOM structures that represent XML (including the Document object) extend the DOM Node interface. This enables the coding of a single method that handles serialization of all DOM node types. Within that method, you can differentiate between node types, but by accepting a Node as input, it enables a very simple way of handling all DOM types. Additionally, it sets up a methodology that allows for recursion, any programmer's best friend. Add the serializeNode( ) method shown here, as well as the initial invocation of that method in the serialize( ) method (the common code point just discussed):

    public void serialize(Document doc, Writer writer)
        throws IOException {

        // Start serialization recursion with no indenting
        serializeNode(doc, writer, "");
        writer.flush( );
    }
 
    public void serializeNode(Node node, Writer writer, 
                              String indentLevel)
        throws IOException {
    }

Additionally, an indentLevel variable is put in place; this sets us up for recursion. In other words, the serializeNode( ) method can indicate how much the node being worked with should be indented, and when recursion takes place, can add another level of indentation (using the indent member variable). Starting out (within the serialize( ) method), there is an empty String for indentation; at the next level, the default is two spaces for indentation, then four spaces at the next level, and so on. Of course, as recursive calls unravel, things head back up to no indentation. All that's left now is to handle the various node types.

5.2.3.3. Working with nodes

Once within the serializeNode( ) method, the first task is to determine what type of node has been passed in. Although you could approach this with a Java methodology, using the instanceof keyword and Java reflection, the DOM language bindings for Java make this task much simpler. The Node interface defines a helper method, getNodeType( ), which returns an integer value. This value can be compared against a set of constants (also defined within the Node interface), and the type of Node being examined can be quickly and easily determined. This also fits very naturally into the Java switch construct, which can be used to break up serialization into logical sections. The code here covers almost all DOM node types; although there are some additional node types defined (see Figure 5-2), these are the most common, and the concepts here can be applied to the less common node types as well:

    public void serializeNode(Node node, Writer writer, 
                              String indentLevel)
        throws IOException {

        // Determine action based on node type
        switch (node.getNodeType( )) {
            case Node.DOCUMENT_NODE:
                break;
            
            case Node.ELEMENT_NODE:
                break;
            
            case Node.TEXT_NODE:
                break;

            case Node.CDATA_SECTION_NODE:
                break;

            case Node.COMMENT_NODE:
                break;
            
            case Node.PROCESSING_INSTRUCTION_NODE:
                break;
            
            case Node.ENTITY_REFERENCE_NODE:
                break;
                
            case Node.DOCUMENT_TYPE_NODE: 
                break;                
        }
    }

This code is fairly useless; however, it helps to see all of the DOM node types laid out here in a line, rather than mixed in with all of the code needed to perform actual serialization. I want to get to that now, though, starting with the first node passed into this method, an instance of the Document interface.

Because the Document interface is an extension of the Node interface, it can be used interchangeably with the other node types. However, it is a special case, as it contains the root element as well as the XML document's DTD and some other special information not within the XML element hierarchy. As a result, you need to extract the root element and pass that back to the serialization method (starting recursion). Additionally, the XML declaration itself is printed out:

            case Node.DOCUMENT_NODE:	
                writer.write("<?xml version=\"1.0\"?>");
                writer.write(lineSeparator);

                Document doc = (Document)node;
                serializeNode(doc.getDocumentElement( ), writer, "");
                break;

WARNING: DOM Level 2 (as well as SAX 2.0) does not expose the XML declaration. This may not seem like a big deal, until you consider that the encoding of the document is included in this declaration. DOM Level 3 is expected to address this deficiency, and I'll cover that in the next chapter. Be careful not to write DOM applications that depend on this information until this feature is in place.

Since the code needs to access a Document-specific method (as opposed to one defined in the generic Node interface), the Node implementation must be cast to the Document interface. Then invoke the object's getDocumentElement( ) method to obtain the root element of the XML input document, and in turn pass that on to the serializeNode( ) method, starting the recursion and traversal of the DOM tree.

Of course, the most common task in serialization is to take a DOM Element and print out its name, attributes, and value, and then print its children. As you would suspect, all of these can be easily accomplished with DOM method calls. First you need to get the name of the XML element, which is available through the getNodeName( ) method within the Node interface. The code then needs to get the children of the current element and serialize these as well. A Node's children can be accessed through the getChildNodes( ) method, which returns an instance of a DOM NodeList. It is trivial to obtain the length of this list, and then iterate through the children calling the serialization method on each, continuing the recursion. There's also quite a bit of logic that ensures correct indentation and line feeds; these are really just formatting issues, and I won't spend time on them here. Finally, the closing bracket of the element can be output:

            case Node.ELEMENT_NODE:
                String name = node.getNodeName( );
                writer.write(indentLevel + "<" + name);
                writer.write(">");
                
                // recurse on each child
                NodeList children = node.getChildNodes( );
                if (children != null) {
                    if ((children.item(0) != null) &&
                        (children.item(0).getNodeType( ) == 
                        Node.ELEMENT_NODE)) {
                            
                        writer.write(lineSeparator);
                    }
                    for (int i=0; i<children.getLength( ); i++) {  
                        serializeNode(children.item(i), writer,
                            indentLevel + indent);
                    }
                    if ((children.item(0) != null) &&
                        (children.item(children.getLength( )-1)
                                .getNodeType( ) ==
                        Node.ELEMENT_NODE)) {
                     
                        writer.write(indentLevel);       
                    }
                }
                
                writer.write("</" + name + ">");
                writer.write(lineSeparator);
                break;

Of course, astute readers (or DOM experts) will notice that I left out something important: the element's attributes! These are the only pseudo-exception to the strict tree that DOM builds. They should be an exception, though, since an attribute is not really a child of an element; it's (sort of) lateral to it. Basically the relationship is a little muddy. In any case, the attributes of an element are available through the getAttributes( ) method on the Node interface. This method returns a NamedNodeMap, and that too can be iterated through. Each Node within this list can be polled for its name and value, and suddenly the attributes are handled! Enter the code as shown here to take care of this:

            case Node.ELEMENT_NODE:
                String name = node.getNodeName( );
                writer.write(indentLevel + "<" + name);
                NamedNodeMap attributes = node.getAttributes( );
                for (int i=0; i<attributes.getLength( ); i++) {
                    Node current = attributes.item(i);
                    writer.write(" " + current.getNodeName( ) +
                                 "=\"" + current.getNodeValue( ) +
                                 "\"");
                }
                writer.write(">");
                
                // recurse on each child
                NodeList children = node.getChildNodes( );
                if (children != null) {
                    if ((children.item(0) != null) &&
                        (children.item(0).getNodeType( ) == 
                        Node.ELEMENT_NODE)) {
                            
                        writer.write(lineSeparator);
                    }
                    for (int i=0; i<children.getLength( ); i++) {
                      serializeNode(children.item(i), writer,
                            indentLevel + indent);
                    }
                    if ((children.item(0) != null) &&
                        (children.item(children.getLength( )-1)
                                .getNodeType( ) ==
                        Node.ELEMENT_NODE)) {
                     
                        writer.write(indentLevel);       
                    }
                }
                
                writer.write("</" + name + ">");
                writer.write(lineSeparator);
                break;

Next on the list of node types is Text nodes. Output is quite simple, as you only need to use the now-familiar getNodeValue( ) method of the DOM Node interface to get the textual data and print it out; the same is true for CDATA nodes, except that the data within a CDATA section should be enclosed within the CDATA XML semantics (surrounded by <![CDATA[ and ]]>). You can add the logic within those two cases now:

            case Node.TEXT_NODE:
                writer.write(node.getNodeValue( ));
                break;

            case Node.CDATA_SECTION_NODE:
                writer.write("<![CDATA[" +
                             node.getNodeValue( ) + "]]>");
                break;

Dealing with comments in DOM is about as simple as it gets. The getNodeValue( ) method returns the text within the  XML constructs. That's really all there is to it; see this code addition:

            case Node.COMMENT_NODE:
                writer.write(indentLevel + "<!-- " +
                             node.getNodeValue( ) + " -->");
                writer.write(lineSeparator);
                break;

Moving on to the next DOM node type: the DOM bindings for Java define an interface to handle processing instructions that are within the input XML document, rather obviously called ProcessingInstruction. This is useful, as these instructions do not follow the same markup model as XML elements and attributes, but are still important for applications to know about. In the table of contents XML document, there aren't any PIs present (although you could easily add some for testing).

The PI node in the DOM is a little bit of a break from what you have seen so far: to fit the syntax into the Node interface model, the getNodeValue( ) method returns all data instructions within a PI in one String. This allows quick output of the PI; however, you still need to use getNodeName( ) to get the name of the PI. If you were writing an application that received PIs from an XML document, you might prefer to use the actual ProcessingInstruction interface; although it exposes the same data, the method names (getTarget( ) and getData( )) are more in line with a PI's format. With this understanding, you can add in the code to print out any PIs in supplied XML documents:

            case Node.PROCESSING_INSTRUCTION_NODE:
                writer.write("<?" + node.getNodeName( ) +
                             " " + node.getNodeValue( ) +
                             "?>");                
                writer.write(lineSeparator);
                break;

While the code to deal with PIs is perfectly workable, there is a problem. In the case that handled document nodes, all the serializer did was pull out the document element and recurse. The problem is that this approach ignores any other child nodes of the Document object, such as top-level PIs and any DOCTYPE declarations. Those node types are actually lateral to the document element (root element), and are ignored. Instead of just pulling out the document element, then, the following code serializes all child nodes on the supplied Document object:

            case Node.DOCUMENT_NODE:
                writer.write("<xml version=\"1.0\">");
                writer.write(lineSeparator);

                // recurse on each child
                NodeList nodes = node.getChildNodes( );
                if (nodes != null) {
                    for (int i=0; i<nodes.getLength( ); i++) {
                        serializeNode(nodes.item(i), writer, "");
                    }
                }
                /*
                Document doc = (Document)node;
                serializeNode(doc.getDocumentElement( ), writer, "");
                */
                break;

With this in place, the code can deal with DocumentType nodes, which represent a DOCTYPE declaration. Like PIs, a DTD declaration can be helpful in exposing external information that might be needed in processing an XML document. However, since there can be public and system IDs as well as other DTD-specific data, the code needs to cast the Node instance to the DocumentType interface to access this additional data. Then, use the helper methods to get the name of the Node, which returns the name of the element in the document that is being constrained, the public ID (if it exists), and the system ID of the DTD referenced. Using this information, the original DTD can be serialized:

            case Node.DOCUMENT_TYPE_NODE: 
                DocumentType docType = (DocumentType)node;
                writer.write("<!DOCTYPE " + docType.getName( ));
                if (docType.getPublicId( ) != null)  {
                    System.out.print(" PUBLIC \"" + 
                        docType.getPublicId( ) + "\" ");              
                } else {
                    writer.write(" SYSTEM ");
                }
                writer.write("\"" + docType.getSystemId( ) + "\">";
                writer.write(lineSeparator);
                break;

All that's left at this point is handling entities and entity references. In this chapter, I will skim over entities and focus on entity references; more details on entities and notations are in the next chapter. For now, a reference can simply be output with the & and ; characters surrounding it:

            case Node.ENTITY_REFERENCE_NODE:
                writer.write("&" + node.getNodeName( ) + ";");    
                break;

There are a few surprises that may trip you up when it comes to the output from a node such as this. The definition of how entity references should be processed within DOM allows a lot of latitude, and also relies heavily on the underlying parser's behavior. In fact, most XML parsers have expanded and processed entity references before the XML document's data ever makes its way into the DOM tree. Often, when expecting to see an entity reference within your DOM structure, you will find the text or values referenced rather than the entity reference itself. To test this for your parser, you'll want to run the SerializerTest class on the contents.xml document (which I'll cover in the next section) and see what it does with the OReillyCopyright entity reference. In Apache, this comes across as an entity reference, by the way.

And that's it! As I mentioned, there are a few other node types, but covering them isn't worth the trouble at this point; you get the idea about how DOM works. In the next chapter, I'll take you deeper than you probably ever wanted to go. For now, let's put the pieces together and see some results.

Example 5-1. The SerializerTest class

5.2.3.1. Getting started

Example 5-2. The DOMSerializer skeleton

5.2.3.2. Launching serialization

5.2.3.3. Working with nodes

Example 5-3. A portion of the output.xml serialized DOM tree

5.2. Serialization

5.2.1. Getting a DOM Parser

5.2.2. DOM Parser Output

5.2.3. DOMSerializer

5.2.4. The Results