Narrative Document Structures (XML in a Nutshell, 2nd Edition)

6.2. Narrative Document Structures

All XML documents are trees. However, trees are very general-purpose data structures. If you've been formally trained in computer science (and very possibly even if you haven't been), you've encountered binary trees, red-black trees, balanced trees, B-trees, ordered trees, and more. However, when working with XML, it's highly unlikely that any given document matches any of these structures. Instead, XML documents are the most general sort of tree, with no particular restrictions on how nodes are ordered or how or which nodes are connected to which other nodes. Narrative XML documents are even less likely than data-oriented XML documents to have an identifiable structure beyond their mere treeness.

So what does a narrative-oriented XML document look like? Of course, there's a root element. All XML documents have one. Generally speaking, this root element represents the document itself. That is, if the document is a book, the root element is book. If the document is an article, the root element is article, and so on.

Beyond that, large documents are generally broken up into sections of some kind, perhaps chapters for a book, parts for an article, or claims for a legal brief. Most of the document consists of these primary sections. In some cases, there'll be several different kinds of sections; for instance, one for the table of contents, one for the index, and one for the chapters of a book.

Generally, the root element also contains one or more elements providing metainformation about the document, for example, the title of the work, the author of the document, the dates the document was written and last modified, and so forth. One common pattern is to place the metainformation in one child of the root element and the main content of the work in another. This is how HTML documents are written. The root element is html. The metainformation goes in a head element, and the main content goes in the body element. TEI and DocBook also follow this pattern.

Sections of the document can be further divided into subsections. The subsections themselves may be further divided. How many levels of subsection appear generally depends on how large the document is. An encyclopedia will have many levels of sectioning--a pamphlet or flier almost none. Each section and subsection normally has a title. It may also have elements or attributes that indicate metainformation about the section, such as the author or date it was last modified.

Up to this point, mixed content is mostly avoided. Elements contain child elements and whitespace, and that's likely all they contain. However, at some level it becomes necessary to insert the actual text of the document--the words that people will read. In most Western languages these will probably be divided into paragraphs and other block-level elements like headlines, figures, sidebars, and footnotes. Generic document DTDs like DocBook won't be able to say more about these items than this.

The paragraphs and other block-level items will mostly contain words in a row, that is, text. Some of this text may be marked up with inline elements. For instance, you may wish to indicate that a particular string of text inside the block-level element is a date, a person, or simply important. However, most of the text will not be so annotated.

One area in which different XML applications diverge is the question of whether block-level items may contain other block-level items. For instance, can a paragraph contain a list? Or can a list item contain a paragraph? It's probably easier to work with more structured documents in which blocks can't contain other blocks (particularly other instances of the same kind). However, it's very often the case that a block has a very good reason to contain other blocks. For instance, a long list item or quotation may contain several paragraphs.

For the most part, this entire structure from the root down to the most deeply nested inline item tends to be quite linear; that is, you expect that a person will read the words in pretty much the same order they appear in the document. If all the markup were suddenly removed and you were left with nothing but the raw text, the result should be more or less legible. The markup can be used to index or format the document, but it's not a fundamental part of the content.

Another important point about these sorts of XML documents: not only are they composed of words in a row; they're composed of words. What they contain is text intended for human beings to read. They're not numbers or dates or money, except insofar as these things occur as part of the normal flow of the narrative. The #PCDATA content of the lowest-level elements of the tree mostly have one type: string. If anything has a real type beyond string it's likely metainformation about the document (figure number, date last modified, and so on) rather than the content of the document itself.

This explains, in detail, why DTDs don't provide strong (or really any) data typing. The documents for which SGML was designed didn't need it. XML documents are doing jobs for which SGML wasn't designed, such as tracking inventories or census data, do need data typing; that's why various people and organizations have invented a plethora of schema languages. However, schemas really don't improve on DTDs for narrative documents.

Not all XML documents are like those we've described here. Not even all narrative-oriented XML documents are like this. However, a surprising number of narrative-oriented XML applications do follow this basic pattern, perhaps with a nip here or a tuck there. The reason is that this is the basic structure narratives follow, and that has proven its usefulness in the thousands of years since writing was invented. If you were to define your own DTDs for general narrative-oriented documents, you'd probably come up with something a lot like this. If you define your own DTDs for more specialized narrative-oriented documents, then the names of your elements may change to reflect your domain--for instance, if you were writing the next edition of the Boy Scout handbook, one of your subsections might be called MeritBadge--however, the basic hierarchy of document, metainformation, sections and subsections, block-level elements, and marked-up text would likely remain.