Checking Documents for Well-Formedness (XML in a Nutshell, 2nd Edition)

2.10. Checking Documents for Well-Formedness

Every XML document, without exception, must be well-formed. This means it must adhere to a number of rules, including the following:

Every start-tag must have a matching end-tag.
Elements may nest, but may not overlap.
There must be exactly one root element.
Attribute values must be quoted.
An element may not have two attributes with the same name.
Comments and processing instructions may not appear inside tags.
No unescaped < or & signs may occur in the character data of an element or attribute.

This is not an exhaustive list. There are many, many ways a document can be malformed. You'll find a complete list in Chapter 20. Some of these involve constructs that we have not yet discussed such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).

Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant. It can't fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that's inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that plagued early web browsers and continue to this day. Consequently, before you publish an XML document, whether that document is a web page, input to a database, or something else, you'll want to check it for well-formedness.

The simplest way to do this is by loading the document into a web browser that understands XML documents such as Mozilla. If the document is well-formed, the browser will display it. If it isn't, then it will show an error message.

Instead of loading the document into a web browser, you can use an XML parser directly. Most XML parsers are not intended for end users. They are class libraries designed to be embedded into an easier-to-use program such as Mozilla. They provide a minimal command-line interface, if that; that interface is often not particularly well documented. Nonetheless, it can sometimes be quicker to run a batch of files through a command-line interface than loading each of them into a web browser. Furthermore, once you learn about DTDs and schemas, you can use the same tools to validate documents, which most web browsers won't do.

There are many XML parsers available in a variety of languages. Here, we'll demonstrate checking for well-formedness with the Apache XML Project's Xerces-J 1.4, which you can download from http://xml.apache.org/xerces-j. This open source package is written in pure Java so it should run across all major platforms. The procedure should be similar for other parsers, though details will vary.

To use this parser, you'll first need a Java 1.1 or later compatible virtual machine. Virtual machines for Windows, Solaris, and Linux are available from http://java.sun.com/. To install Xerces-J 1.4.4, just add xerces.jar and xercesSamples.jar files to your Java class path. In Java 2 you can simply put those .jar files into your jre/lib/ext directory.

The class that actually checks files for well-formedness is called sax.SAXCount. It's run from a Unix shell or DOS prompt like any other standalone Java program. The command-line arguments are the URLs to or filenames of the documents you want to check. Here's the result of running SAXCount against an early version of Example 2-5. The very first line of output tells you where the first problem in the file is. The rest of the output is a more or less irrelevant stack trace.

D:\xian\examples\02>java sax.SAXCount 2-5.xml
[Fatal Error] 2-5.xml:3:30: The value of attribute "height" must not contain the '<' character.
Stopping after fatal error: The value of attribute "height" must not contain the '<' character.
at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:
1282)
at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(
XMLDocumentScanner.java:644)
at org.apache.xerces.framework.XMLDocumentScanner.scanAttValue(
XMLDocumentScanner.java:519)
at org.apache.xerces.framework.XMLParser.scanAttValue(
XMLParser.java:1932)
at org.apache.xerces.framework.XMLDocumentScanner.scanElement(
XMLDocumentScanner.java:1800)
at org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.
dispatch(XMLDocumentScanner.java:1223)
at org.apache.xerces.framework.XMLDocumentScanner.parseSome(
XMLDocumentScanner.java:381)
at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1138)
at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1177)
at sax.SAXCount.print(SAXCount.java:135)
at sax.SAXCount.main(SAXCount.java:331)

As you can see, it found an error. In this case the error message wasn't particularly helpful. The actual problem wasn't that an attribute value contained a < character. It was that the closing quote was missing from the height attribute value. Still, that was enough for us to locate and fix the problem. Despite the long list of output, SAXCount only reports the first error in the document, so you may have to run it multiple times until all the mistakes are found and fixed. Once we fixed Example 2-5 to make it well-formed, SAXCount simply reported how long it took to parse the document and what it saw when it did:

D:\xian\examples\02>java sax.SAXCount 2-5.xml
2-5.xml: 140 ms (17 elems, 12 attrs, 0 spaces, 564 chars)

Now that the document has been corrected to be well-formed, it can be passed to a web browser, a database, or whatever other program is waiting to receive it. Almost any nontrivial document crafted by hand will contain well-formedness mistakes. That makes it important to check your work before publishing it.

TIP: This example works with Xerces-J 1.0 through 1.4.4. The recently released Xerces-J 2.0 provides a similar program named sax.Counter.