Chapter 15. XML
Languages and Metalanguages
HTML is a maverick. It follows the rules of formal electronic document-markup design and implementation only loosely. HTML was born out of the need to assemble text, graphics, and other digital content into electronic documents that could be sent over the global Internet. In the early days of the World Wide Web boom, the demand for better browsers and document servers -- driven by hordes of new users with insatiable appetites for more and cooler web pages -- left little time for worrying about things like standards and practices.
Of course, without guiding standards, HTML would have eventually devolved into Babel. That almost happened during the browser wars in the mid- to late 90s. Chaos is not an acceptable foundation for an industry whose value is already measured in the trillions of dollars. Although the standards people at the W3C managed to rein in the maverick HTML with standard Version 4, it is still too wild for the royal herd of markup languages.
The HTML 4.01 standard is defined using the Standardized Generalized Markup Language (SGML). While more than adequate for formalizing HTML, SGML is far too complex to use as a general tool for extending and enhancing HTML. Instead, the W3C has devised a new standard known as the Extensible Markup Language, or XML. Based upon the simpler features of SGML, XML is kinder, gentler, and more flexible, well-suited to guide the birth and orderly development of new markup languages. With XML, HTML itself is being reborn as XHTML.
In this chapter, we cover the basics of XML, including how to read it, how to create simple XML Document Type Definitions (DTDs), and the ways you might use XML to enhance your use of the Internet. In the next chapter, we explore the depths of XHTML.
You don't have to understand all about XML to write XHTML. We think it's helpful, but if you want to cut to the chase, feel free to skip to the next chapter. However, you may want to take a look at some of the up-and-coming uses of XML covered at the end of this chapter, starting in Section 15.8, "Using XML">.
This chapter provides only an overview of XML. Our goal is to whet your appetite and make you conversant in XML. It is only an overview. For full fluency, consult books such as XML: A Primer by Simon St. Laurent (IDG Books Worldwide), or The XML Handbook by Paul Prescod and Charles Goldfarb (Prentice Hall).
15.1. Languages and Metalanguages
A language is comprised of symbols that we assemble in a meaningful way to express ourselves and pass along information in a way that is intelligible to others. For example, English is a language with rules (grammar) that define how to put its symbols (words) together to form sentences, paragraphs, and, ultimately, books like the one you are holding. If you know the words and understand the grammar, you can read the book, even if you don't necessarily understand its contents.
An important difference between human and computer-based languages is that human languages are self-describing. We use English sentences and paragraphs to define how to create correct English sentences and paragraphs. Our brains are marvelous machines that have no problem understanding that you can use a language to describe itself. However, computer languages are not so rich and computers are not so bright that you could easily define a computer language with itself. Instead, we can define one language -- a metalanguage -- that defines the rules and symbols of another language.
Software developers can use a metalanguage to define the rules for defining a language and then define one or more languages based on those rules. The metalanguage also guides developers creating the automated agents that display or otherwise process the contents of documents that authors have created using that language.
XML is a metalanguage created by the W3C and is used by developers to define markup languages such as XHTML. Browser developers rely on XML's metalanguage rules to create automated processes that read the language definition of XHTML and implement the processes that ultimately display or otherwise process XHTML documents.
Why bother with a markup metalanguage? Because as the familiar proverb goes, the W3C wants to teach us how to fish so we can feed ourselves for a lifetime. With XML, there is now a standardized way to define markup languages that are customized for different needs rather than having to rely upon HTML extensions. Mathematicians need a way to express mathematical notations; composers need a way to present musical scores; businesses want their web sites to take sales orders from customers; physicians look to exchange medical records; plant managers want to run their factories from web-based documents. All these groups need an acceptable, resilient way to express these different kinds of information, so that the software industry can develop the programs that process and display these diverse documents.
XML provides the answer. Each content sector -- the business group, the factory-automation consortium, the trade association -- may now define a markup language to suit its particular needs for information exchange and processing over the Web. Computer programmers can create XML-compliant processes -- parsers -- that read the new language definitions and allow the server to process the documents of those languages.
15.1.1. Creation Versus Display
While there is no limit to the kinds of markup languages you can create with XML, displaying your new documents may be more complicated. When you write HTML, a browser understands what to do with the <h1> tag because it is defined in the HTML DTD and browsers have been programmed to display all standard HTML tags.
With XML, you might create a new DTD for describing recipes. It would be a great way to capture and standardize all those kumquat recipes you've been collecting in your kitchen drawers. With special <ingredient> and <portion> tags, the recipes are easy to define and understand. However, browsers won't know what to do with these new tags unless you attach a style sheet that defines their handling. Without a stylesheet, XML-capable browsers such as Internet Explorer 5 and Netscape 6 will render these tags in a very generic way, cerainly not the flourishing presentation your kumquat recipes deserve.
Even with stylesheets, there are limitations to presenting XML-based information. Let's say you want to create something more challenging, such as a DTD for musical notation or silicon chip design. While describing these data types in a DTD is possible, displaying this information graphically is certainly beyond the capabilites of any stylesheets we've seen yet. It would require a specialized rendering tool to properly display this type of graphically rich information.
Nonetheless, your recipe DTD is a great tool for capturing and sharing recipes. As we'll see later in this chapter, XML isn't simply about creating markup languages for displaying content in browsers. It has great promise for sharing and managing information, so that those precious kumquat dishes will be preserved for many generations to come. Just bear in mind that in addition to writing a DTD to describe your new XML-based markup language, you will in most cases want to supplement the DTD with a stylesheet.
15.1.2. A Little History
To complete your education into the whys and wherefores of markup languages, it helps to know how all these markup languages came to be.
In the beginning, there was SGML, the Standardized Generalized Markup Language. SGML was intended to be the only markup metalanguage, from which all other markup languages would be created. Everything from hieroglyphics to HTML can be defined using SGML, negating the need for any other metalanguage.
The problem with SGML is that it is so broad and all-encompassing that mere mortals cannot use it. Using SGML effectively requires very expensive and complex tools that are completely beyond the scope of regular people who just want to bang out an HTML document in their spare time. As a result, other markup languages that are greatly reduced in scope and much easier to use have been created. The HTML standards themselves were initially defined using a subset of SGML that eliminated many of the more esoteric features. The DTD in Appendix D, "The HTML 4.01 DTD" uses this subset of SGML to define the HTML 4.01 standard.
Recognizing that SGML was too unwieldy to describe HTML in a useful way and that there was a growing need to define other HTML-like markup languages, the World Wide Web Consortium defined XML. XML is a formal markup metalanguage that uses select features of SGML to define markup languages in a style similar to that of HTML. It eliminates many SGML elements that aren't applicable to languages like HTML and simplifies other elements to make them easier to use and understand.
XML is a middle ground between SGML and HTML, a useful tool for defining a wide variety of markup languages. XML will become increasingly important as the Web extends beyond browsers and moves into the realm of direct data interchange between people, computers, and disparate systems. A small number of people may wind up creating new markup languages with XML, and many more people will want to be able to understand XML DTDs in order to use all these new markup languages.
Copyright © 2002 O'Reilly & Associates. All rights reserved.