XML and Cocoon (Apache: The Definitive Guide, 3rd Edition)

19.1. XML

Like HTML, Extensible Markup Language (XML) uses markup (elements, attributes, comments, etc.) to identify content within a document. Unlike HTML, XML lets developers create their own vocabularies to describe that content, encouraging a much greater separation of content from presentation. When we wrote this page, we put the chapter title at the top right hand corner of a blank page: "XML and Cocoon." Then we started on the text:

So far we have talked about different ways of writing scripts, worrying more about the logic they contain than their content...

If you put this book down open and come back to it tomorrow, a glance at the top of the page reminds you of the subject of this chapter, and a glance at the top of the paragraph reminds you where we have got to in that chapter.

It is not necessary to explain what these typographic page elements are telling you because we have all been reading books for years in a civilization that has had cheap printing and widespread literacy for half a millennium, so we don't even think about the conventions that have developed.

Putting the right message in the right sort of type in the right place on the page in order to convey the right meaning to the reader was originally a specialized technical job done by the book editor and the printer.

Now, computing is changing all that. We typeset our own manuscripts with the help of publishing packages. We publish our own books without the help of trained editors. We don't have to bother with the book format: we publish our own web pages by the billion, often without recourse to any standards of layout, intelligibility, or even sanity. Since computer data has no inherent format to tell us what it means, there is — and has been for a long time — an urgent need for some sort of markup language to tell us at what we are looking.

A start was made on solving the problem many decades ago with the Standard Generalized Markup Language (SGML). This evolved informally for a long time and then was accepted by the International Organization for Standardization (ISO) in 1986. SGML has been taken up in a number of industries and used to define more specfic tag languages: ATA-2100 for aircraft maintenance manuals, PCIS in the semiconductor industry, DocBook for software documentation in the computer industry.

HTML is an application of SGML. It uses a very small subset of SGML's functionality with a single vocabulary. Its limitations are growing clearer, even though millions of lines of it are in use every second of the day around the world. The trouble is that HTML simply says how text should appear on the client's computer screen. You might be a nurse looking at a web page containing a patient's medical record. The patient is lying unconscious on a stretcher and desperately needs penicillin. Is she allergic to the drug? The word "penicillin" might appear 20 times in his record — she was given it on various dates scattered here and there. Did one of these turn out badly? Is there a note somewhere about allergies? You might have to read a hundred pages, and you haven't the time. What you need is a standard medical markup:

<allergies><drug-reactions>....</drug-reactions></allergies>

and a quick way of finding it, probably through an applet.

In principle, SGML could do what is wanted on the Web. Unfortunately, it is very complicated; it was first specified in the days when every byte mattered, so it is full of cunning shortcuts, it is too big for developers to learn, and it's too big for browsers to implement. So XML is a cut-down version that does what is needed and not too much more. XML requires much stricter attention to document structure but offers a much wider choice of vocabularies in return.

On the other hand, XML differs from HTML in that it is a completely generalized markup language. HTML has a small list of prespecified tags: <HEAD>, <H2>, <HREF...>, etc. XML has no prespecified tags at all. Its tags are invented by its users as necessary to define the information that a page will carry — as, for instance <allergies><drug-reactions> earlier. The tags to be used are stored in a Document Type Definition (DTD) (soon to be replaced by XML Schemas). The DTD also defines the structure of the document as a tree: <book> s contain <chapter> s and <chapter> s contain <paragraph> s. A <paragraph> never contains a <book>. A <drug-reaction> comes inside the more general <allergies>, and so on. It is technically quite simple to write a DTD, but in most applications much more work goes into getting the agreement of other people about the structure of the document and the types of information that need to be in it. (For more information on writing DTDs, see Erik Ray's Learning XML (O'Reilly, 2000.)

The idea of XML goes way beyond formatting and displaying information, though that is a very useful consequence. It is a way of handling information to produce other information. The usefulness of this approach is well explained by Brett McLaughlin in his Java and XML.[67] He uses as an illustration the process of selling a network line to a customer.

[67]Brett McLaughlin, Java and XML (O'Reilly & Associates, Inc., 2001).

...When a network line, such as a DSL or T1, is sold to a customer, a variety of things must happen. The provider of the line, such as UUNet, must be informed of the request for a new line. A router must be configured by the CLEC and the setup of the router must be coordinated with the Internet service provider. Then an installation must occur, which may involve another company if this process is outsourced. This relatively common and simple sale of a network line already involves three companies. Add to this the technical service group for the manufacturer of the router, the phone company for the customer's other communication services, and the InterNIC to register a domain, and the process becomes significant.

This rather intimidating process can be made extremely simple with the use of XML. Imagine that the original request for a line is put into a system that converts the request into an XML document. The document is then transferred via XSL, into a format that can be sent to the line provider, UUNet in our example. UUNet then adds line-specific information, transforming the request into yet another XML document, which is returned to the CLEC. This new document is passed on to the installation company with additional information about where the client is located. Upon installation, notes about whether or not the installation was successful are added to the document, which is transformed again via XSL and passed back to the original CLEC application. The beauty of this solution is that instead of multiple systems, each using vendor-specific formatting, the same set of XML APIs can be used at every step, allowing a standard interface for the XML data across the applications, systems, and even businesses.

One might add that if all the participants in the process subscribe to an industry-standard DTD, it would not even be necessary to transform the documents using XSL.

As this process proceeds, hard copies of documents will need to be printed out and signed to show that legally important stages in the transaction have been reached. This can be done by stylesheets written in XSL — Extensible Stylesheet Language. The stylesheet specifies the font type-size and position of all the elements of the document. It can control a certain amount of reformatting: a long document might start with a list of contents generated by collecting the section headers and their page numbers. Different but similar stylesheets could produce the same document in a variety of different formats: HTML, PDF, WML (for WAP devices), even voice for the blind, or Braille.

Clearly the Web has to have something like XML, and sooner or later we will all be using it if we want to publish serious amounts of information. No one suggests that HTML will vanish overnight because it is very suitable for small jobs — just as you wouldn't use a full blown book-production software package to write a letter. The W3C is rebuilding HTML on an XML foundation, called XHTML, to facilitate that transition. For the moment, XML's use on the Web is more impending rather than actual, but it is growing rapidly. A few of the many vocabularies include the following:

Math Markup Language: http://www.w3.org/Math/
CML (Chemical Markup Language): http://www.oasis-open.org/cover/gen-apps.html
Astronomical Instrument Markup Language: http://pioneer.gsfc.nasa.gov/public/aiml/
Bioinformation Sequence Markup Language: http://www.visualgenomics.com/bsml/index.html
MusicML (for sharing sheet music):http://195.108.47.160/index.html
Weather Observation Definition Format: http://zowie.metnet.navy.mil
Newspaper Classified Ad ML: http://www.naa.org/technology/clsstdtf/index.html

For a huge list of vocabularies and supporting technologies, see the XML Cover Pages at http://xml.coverpages.com.

People supplying and exchanging information use XML as a medium that allows them to specify the meaning and the value of bits of information. Often several XML documents are merged to create a new output. In theory you can send the resulting XML and a CSS or XSLT stylesheet to a browser, and something will appear that can be read on a screen. However, in practice, few browsers will properly interpret XML. Microsoft Internet Explorer v5 and later offer some capability, while Opera Version 4 or later, Netscape 6 or later, and all of the Mozilla builds offer more control over the presentation of XML documents. Older browsers that appeared before XML's 1998 release have little idea what to do with the unfamiliar markup.

It would be nice if browsers did the conversion because it shifts the processing burden from the server to the client (and since we are buyers of server hardware, this is better). For the moment and possibly for a long time in the future, people who want to display XML data on the Web have to convert their pages to HTML (or perhaps PDF or some other format) by putting it through some more or less clever program. Although it is possible in principle to transform XML into, say, HTML by applying a stylesheet, the "applying" bit may not be so easy. You might have to write (but see later) a script in Perl to make the transformation. Clearly, this isn't something that every webmaster wants to do, and software to do the job properly is available as a "publishing framework." There are a number of contenders, but a package well suited to Apache users is Cocoon, which is produced under the auspices of the Apache XML project.

Chapter 19. XML and Cocoon

Contents:

19.1. XML