Chapter 15. XML
HTML is a maverick. It follows the rules of formal electronic
document-markup design and implementation only loosely. HTML was born
out of the need to assemble text, graphics, and other digital content
into electronic documents that could be sent over the global
Internet. In the early days of the World Wide Web boom, the demand
for better browsers and document servers -- driven by hordes of
new users with insatiable appetites for more and cooler web
pages -- left little time for worrying about things like standards
and practices.
Of course, without guiding standards, HTML would have eventually
devolved into Babel. That almost happened during the browser wars in
the mid- to late 90s. Chaos is not an acceptable foundation for an
industry whose value is already measured in the trillions of dollars.
Although the standards people at the W3C managed to rein in the
maverick HTML with standard Version 4, it is still too wild for the
royal herd of markup languages.
The HTML 4.01 standard is defined using the Standardized Generalized Markup Language
(SGML). While more than adequate for formalizing HTML, SGML is far
too complex to use as a general tool for extending and enhancing
HTML. Instead, the W3C has devised a new standard known as the
Extensible Markup Language, or XML.
Based upon the simpler features of SGML, XML is kinder, gentler, and
more flexible, well-suited to guide the birth and orderly development
of new markup languages. With XML, HTML itself is being reborn as
XHTML.
In this chapter, we cover the basics of XML, including how to read
it, how to create simple XML Document Type Definitions (DTDs), and
the ways you might use XML to enhance your use of the Internet. In
the next chapter, we explore the depths of XHTML.
You don't have to understand all about XML to write XHTML. We
think it's helpful, but if you want to cut to the chase, feel
free to skip to the next chapter. However, you may want to take a
look at some of the up-and-coming uses of XML covered at the end of
this chapter, starting in Section 15.8, "Using XML">.
This chapter provides only an overview of XML. Our goal is to whet
your appetite and make you conversant in XML. It is only an overview.
For full fluency, consult books such as XML: A
Primer by Simon St. Laurent (IDG Books Worldwide), or
The XML Handbook by Paul Prescod and Charles
Goldfarb (Prentice Hall).
15.1. Languages and Metalanguages
A language is comprised of symbols that
we assemble in a meaningful way to express ourselves and pass along
information in a way that is intelligible to others. For example,
English is a language with rules (grammar) that define how to put its
symbols (words) together to form sentences, paragraphs, and,
ultimately, books like the one you are holding. If you know the words
and understand the grammar, you can read the book, even if you
don't necessarily understand its contents.
An important difference between human and computer-based languages is
that human languages are self-describing. We use English sentences
and paragraphs to define how to create correct English sentences and
paragraphs. Our brains are marvelous machines that have no problem
understanding that you can use a language to describe itself.
However, computer languages are not so rich and computers are not so
bright that you could easily define a computer language with itself.
Instead, we can define one language -- a metalanguage
-- that defines the rules and symbols of another
language.
Software developers can use a metalanguage to define the rules for
defining a language and then define one or more languages based on
those rules.[75] The metalanguage
also guides developers creating the automated agents that display or
otherwise process the contents of documents that authors have created
using that language.
XML is a metalanguage created by the W3C and is used by developers to
define markup languages such as XHTML. Browser developers rely on
XML's metalanguage rules to create automated processes that
read the language definition of XHTML and implement the processes
that ultimately display or otherwise process XHTML documents.
Why bother with a markup metalanguage? Because as the familiar
proverb goes, the W3C wants to teach us how to fish so we can feed
ourselves for a lifetime. With XML, there is now a standardized way
to define markup languages that are customized for different needs
rather than having to rely upon HTML extensions. Mathematicians need
a way to express mathematical notations; composers need a way to
present musical scores; businesses want their web sites to take sales
orders from customers; physicians look to exchange medical records;
plant managers want to run their factories from web-based documents.
All these groups need an acceptable, resilient way to express these
different kinds of information, so that the software industry can
develop the programs that process and display these diverse
documents.
XML provides the answer. Each content sector -- the business
group, the factory-automation consortium, the trade
association -- may now define a markup language to suit its
particular needs for information exchange and processing over the
Web. Computer programmers can create XML-compliant
processes -- parsers -- that read the new language definitions
and allow the server to process the documents of those languages.
15.1.1. Creation Versus Display
While there is no limit to the kinds of markup languages you can
create with XML, displaying your new documents may be
more complicated. When you write HTML, a browser understands what to
do with the <h1> tag because it is defined
in the HTML DTD and browsers have been programmed to display all
standard HTML tags.
With XML, you might create a new DTD for describing recipes. It would
be a great way to capture and standardize all those kumquat recipes
you've been collecting in your kitchen drawers. With special
<ingredient> and
<portion> tags, the recipes are easy to
define and understand. However, browsers won't know what to do
with these new tags unless you attach a style sheet that defines
their handling. Without a stylesheet, XML-capable browsers such as
Internet Explorer 5 and Netscape 6 will render these tags
in a very generic way, cerainly not the flourishing presentation your
kumquat recipes deserve.
Even with stylesheets, there are limitations to
presenting XML-based information. Let's say you want to create
something more challenging, such as a DTD for musical notation or
silicon chip design. While describing these data types in a DTD is
possible, displaying this information graphically is certainly beyond
the capabilites of any stylesheets we've seen yet. It would
require a specialized rendering tool to properly display this type of
graphically rich information.
Nonetheless, your recipe DTD is a great tool for capturing and
sharing recipes. As we'll see later in this chapter, XML
isn't simply about creating markup languages for displaying
content in browsers. It has great promise for sharing and managing
information, so that those precious kumquat dishes will be preserved
for many generations to come. Just bear in mind that in addition to
writing a DTD to describe your new XML-based markup language, you
will in most cases want to supplement the DTD with a
stylesheet.[76]
15.1.2. A Little History
To complete your education into the whys and wherefores of markup
languages, it helps to know how all these markup languages came to
be.
In the beginning, there was SGML, the Standardized Generalized Markup
Language. SGML was intended to be the only markup metalanguage, from
which all other markup languages would be created. Everything from
hieroglyphics to HTML can be defined using SGML, negating the need
for any other metalanguage.
The problem with SGML is that it is so broad and all-encompassing
that mere mortals cannot use it. Using SGML effectively requires very
expensive and complex tools that are completely beyond the scope of
regular people who just want to bang out an HTML document in their
spare time. As a result, other markup languages that are greatly
reduced in scope and much easier to use have been created. The HTML
standards themselves were initially defined using a subset of SGML
that eliminated many of the more esoteric features. The DTD in Appendix D, "The HTML 4.01 DTD" uses this subset of SGML to define the HTML
4.01 standard.
Recognizing that SGML was too unwieldy to describe HTML in a useful
way and that there was a growing need to define other HTML-like
markup languages, the World Wide Web Consortium defined XML. XML is a
formal markup metalanguage that uses select features of SGML to
define markup languages in a style similar to that of HTML. It
eliminates many SGML elements that aren't applicable to
languages like HTML and simplifies other elements to make them easier
to use and understand.
XML is a middle ground between SGML and HTML, a useful tool for
defining a wide variety of markup languages. XML will become
increasingly important as the Web extends beyond browsers and moves
into the realm of direct data interchange between people, computers,
and disparate systems. A small number of people may wind up creating
new markup languages with XML, and many more people will want to be
able to understand XML DTDs in order to use all these new
markup languages.
 |  |  | 14.3. Layers |  | 15.2. Documents and DTDs |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|