Some applications do store more permanent computer-oriented data in
XML. For instance, XML is the native file format of the Gnumeric
spreadsheet. On the other hand, this format is really only understood
by Gnumeric and perhaps the other Gnome applications.
It's designed to meet the specific needs of that one
program. Exchanging data with other applications, including ones that
haven't even been invented yet, is a secondary
concern.
XML documents meant for humans tend to be more permanent and less
software bound, however. If you encode the Declaration of
Independence in XML, you want people to be able to read it in two,
two hundred, or two thousand years. You also want them to be able to
read it with any convenient tool, including ones not invented yet.
These requirements have some important implications for both the XML
applications you design to hold the data and the tools you use to
read and write them.
Standard formats like DocBook and TEI should be preferred to custom,
one-off XML applications. You should avoid proprietary DTDs that are
owned by any one person or company and whose future may depend on the
fortunes of that company or individual. Even DTDs that come from
nonprofit consortia like OASIS or TEI should be licensed sufficiently
liberally so that intellectual property restrictions
won't let anyone throw up road blocks in your path.
At least one DTD purveyor has gone so far as to file for patents on
its DTDs. These DTDs should be avoided like the plague. Stick to DTDs
that may be freely copied and shared and that can be retrieved from
many different locations.
Once you've settled on a standard DTD, try to avoid
modifying it if you possibly can. If you absolutely must modify it,
then document your changes in excruciating, redundant detail. Include
comments in both your DTDs and documents, explaining what
you've done. Use the parameter entities built into
the DTDs to add new element types or subtract old ones, rather than
modifying the DTD files themselves.
Conversely, the format shouldn't be too hard to
reverse engineer if the documentation is lost. Make sure full names
are used throughout for element and attribute names.
DocBook's para element is
superior to TEI's p element.
Paragraph would be better still.
<polygon style="fill: blue; stroke: green; stroke-width: 12"
points="350,75 379,161 469,161 397,215 423,301 350,250
277,301 303,215 231,161 321,161" />
The style attribute contains three separate and
barely related items. Understanding this element requires parsing the
non-XML CSS format. The points attribute is even
worse. It's a long list of numbers, but
there's no information about what each number is.
You can't, for instance, see which are the x and
which are the y coordinates. An approach like this is preferable:
<polygon fill="blue" stroke="green" stroke-width="12">
<point x="350" y="75"/>
<point x="379" y="161"/>
<point x="469" y="161"/>
<point x="397" y="215"/>
<point x="423" y="301"/>
<point x="350" y="250"/>
<point x="277" y="301"/>
<point x="303" y="215"/>
<point x="231" y="161"/>
<point x="321" y="161"/>
</polygon>
The attribute-based style syntax is actually allowed in SVG. However,
the debate over which form to use for coordinates was quite heated in
the W3C SVG working group. In the end the working group decided
(wrongly, in our opinion) that the more verbose form would never be
adopted because of its size, even though most members felt it was
more in keeping with the spirit of XML. We think the working group
overemphasized the importance of document size in an era of
exponentially growing hard disks and network bandwidth, not to
mention ignoring the ease with which the second format could be
compressed for transport or storage.