Transformation and Presentation (XML in a Nutshell, 2nd Edition)

6.6. Transformation and Presentation

The markup in a typical XML document describes the document's structure, but it tends not to describe the document's presentation. That is, it says how the document is organized but not how it looks. Although XML documents are text, and a person could read them in native form if they really wanted to, much more commonly an XML document is rendered into some other format before being presented to a human audience. One of the key ideas of markup languages in general and XML in particular is that the input format need not be the same as the output format. To put it another way, what you see is not what you get, nor is it what you want to get. The input markup language is designed for the convenience of the writer. The output language is designed for the convenience of the reader.

Of course this requires a means of transforming the input format into the output format. Most XML documents undergo some kind of transformation before being presented to the reader. The transformation may be to a different XML vocabulary like XHTML or XSL-FO, or it may be to a non-XML format like PostScript or RTF.

XML's semiofficial transformation language is Extensible Stylesheet Language Transformations (XSLT). An XSLT document contains a list of template rules. Each template rule has a pattern noting which elements and other nodes it matches. An XSLT processor reads the input document. When it sees something in the input document that matches a template rule in the stylesheet, it outputs the template rule's template. Part of the template is normally an instruction that tells the processor to include content from the input in the output. This allows, for example, the text of the output document to be the same while all the markup is changed. For instance, you could write a stylesheet that would transform DocBook documents into TEI documents. XSLT will be discussed in much more detail in Chapter 8.

However, XSLT is not the only transformation language you can use with your XML documents. Other stylesheet languages such as the Document Style Sheet and Semantics Language (DSSSL, http://www.jclark.com/dsssl/) are also available. So are a variety of proprietary tools like OmniMark (http://www.omnimark.com/). Most of these have particular strengths and weaknesses for particular kinds of documents. Custom programs written in a variety of programming languages, such as Java, C++, Perl, and Python, can use a plethora of APIs, such as SAX, DOM, and JDOM, to transform documents. This is sometimes useful when you need something more than a mere transformation--for instance, interpreting certain elements as database queries and actually inserting the results of those queries into the output document, or asking the user to answer questions in the middle of the transformation. However, the biggest single factor when choosing which tool to use is simply which language and syntax you're most comfortable with. De linguis non disputandum est.

There are many different choices for the output format from a transformation. A PostScript file can be printed on paper, overhead transparencies, slides, or even T-shirts. A PDF document can be viewed in all these ways and shown on the screen as well. However, for screen display, PDF is vastly inferior to simple HTML, which has the advantages of being very broadly accessible across platforms and being very easy to generate via XSLT from source XML documents. Generating a PDF or a PostScript file normally requires an additional conversion step in which special software converts some custom XML output format like XSL-FO to what you actually want.

An alternative to a transformation-based presentation is to provide a descriptive stylesheet that simply states how each element in the original document should be formatted. This is the realm of Cascading Style Sheets (CSS). This works particularly well for narrative documents where all that's needed is a list of the fonts, styles, sizes, and so on to apply to the content of each element. The key is that when all markup is stripped from the document, what remains is more or less a plain-text version of what you want to see. No reordering or rearrangement is necessary. This approach works less well for data-oriented documents where the raw content may be nothing more than an undifferentiated mass of numbers, dates, or other information that's hard to understand without the context and annotations provided by the markup. However, in this case a combination of the two approaches works well. First a transformation can produce a new document containing rearranged and annotated information. Then a CSS stylesheet can apply style rules to the elements in this transformed document.