Document Permanence (XML in a Nutshell, 2nd Edition)

6.5. Document Permanence

XML documents that are intended for computers to read are often transitory. For instance, if you create a SOAP document that represents a request to Windows server running .NET, then that document exists for just as long as it takes the client to send it to the server and for the server to parse it into its internal data structures. After that's done, the document will be discarded. It probably won't be around for two minutes, much less two years. It's an ephemeral communication between two systems, with no more permanence than any of billions of other messages that computers exchange on a daily basis, most of which are never even written to disk, much less archived for posterity.

Some applications do store more permanent computer-oriented data in XML. For instance, XML is the native file format of the Gnumeric spreadsheet. On the other hand, this format is really only understood by Gnumeric and perhaps the other Gnome applications. It's designed to meet the specific needs of that one program. Exchanging data with other applications, including ones that haven't even been invented yet, is a secondary concern.

XML documents meant for humans tend to be more permanent and less software bound, however. If you encode the Declaration of Independence in XML, you want people to be able to read it in two, two hundred, or two thousand years. You also want them to be able to read it with any convenient tool, including ones not invented yet. These requirements have some important implications for both the XML applications you design to hold the data and the tools you use to read and write them.

The first rule is that the format should be very well documented. There should be a DTD, and that DTD should be very well commented. Furthermore, there should be a significant amount of prose documentation as well. Prose documentation can't substitute for the formal documentation of a DTD, but it's an invaluable asset in understanding the DTD.

Standard formats like DocBook and TEI should be preferred to custom, one-off XML applications. You should avoid proprietary DTDs that are owned by any one person or company and whose future may depend on the fortunes of that company or individual. Even DTDs that come from nonprofit consortia like OASIS or TEI should be licensed sufficiently liberally so that intellectual property restrictions won't let anyone throw up road blocks in your path. At least one DTD purveyor has gone so far as to file for patents on its DTDs. These DTDs should be avoided like the plague. Stick to DTDs that may be freely copied and shared and that can be retrieved from many different locations.

Once you've settled on a standard DTD, try to avoid modifying it if you possibly can. If you absolutely must modify it, then document your changes in excruciating, redundant detail. Include comments in both your DTDs and documents, explaining what you've done. Use the parameter entities built into the DTDs to add new element types or subtract old ones, rather than modifying the DTD files themselves.

Conversely, the format shouldn't be too hard to reverse engineer if the documentation is lost. Make sure full names are used throughout for element and attribute names. DocBook's para element is superior to TEI's p element. Paragraph would be better still.

All of the inherent structure of the document should be indicated by markup and markup alone. It should not be left for the user to infer, nor should it be encoded using whitespace or other separators. For instance, here's an example of what not to do from SVG:

<polygon style="fill: blue; stroke: green; stroke-width: 12"
         points="350,75  379,161 469,161 397,215 423,301 350,250
                 277,301 303,215 231,161 321,161" />

The style attribute contains three separate and barely related items. Understanding this element requires parsing the non-XML CSS format. The points attribute is even worse. It's a long list of numbers, but there's no information about what each number is. You can't, for instance, see which are the x and which are the y coordinates. An approach like this is preferable:

<polygon fill="blue" stroke="green" stroke-width="12">
  <point x="350" y="75"/>
  <point x="379" y="161"/>
  <point x="469" y="161"/>
  <point x="397" y="215"/>
  <point x="423" y="301"/>
  <point x="350" y="250"/>
  <point x="277" y="301"/>
  <point x="303" y="215"/>
  <point x="231" y="161"/>
  <point x="321" y="161"/>
</polygon>

The attribute-based style syntax is actually allowed in SVG. However, the debate over which form to use for coordinates was quite heated in the W3C SVG working group. In the end the working group decided (wrongly, in our opinion) that the more verbose form would never be adopted because of its size, even though most members felt it was more in keeping with the spirit of XML. We think the working group overemphasized the importance of document size in an era of exponentially growing hard disks and network bandwidth, not to mention ignoring the ease with which the second format could be compressed for transport or storage.

Stylesheets are important. We're all familiar with the injunction to separate presentation from content. You've heard enough warnings about not including mere style information like italics and font choices in your XML documents. However, be careful not to go the other way and include content in your stylesheets either. Author names, titles, copyrights and other such information that changes from document to document belongs in the document, not the stylesheet, even if it's metainformation about the document rather than the actual content of the document.

Always keep in mind that you're not just writing for the next couple months or years, but possibly for the next couple thousand of years. Have pity on the poor historians who are going to have to decipher your markup with limited tools to help them.