Chapter 1. Introducing XML
XML, the Extensible Markup Language, is a W3C-endorsed standard for
document markup. It defines a generic syntax used to mark up data
with simple, human-readable tags. It provides a standard format for
computer documents. This format is flexible enough to be customized
for domains as diverse as web sites, electronic data interchange,
vector graphics, genealogy, real-estate listings, object
serialization, remote procedure calls, voice-mail systems, and more.
You can write your own programs that interact with, massage, and
manipulate the data in XML documents. If you do,
you'll have access to a wide range of free libraries
in a variety of languages that can read and write XML so that you can
focus on the unique needs of your program. Or you can use
off-the-shelf software, such as web browsers and text editors, to
work with XML documents. Some tools are able to work with any XML
document. Others are customized to support a particular XML
application in a particular domain, such as vector graphics, and may
not be of much use outside that domain. But in all cases, the same
underlying syntax is used, even if it's deliberately
hidden by the more user-friendly tools or restricted to a single
application.
1.1. The Benefits of XML
XML is a metamarkup language for text
documents. Data is included in XML documents as strings of text. The
data is surrounded by text markup that describes the data.
XML's basic unit of data and markup is called an
element. The XML specification defines the exact
syntax this markup must follow: how elements are delimited by tags,
what a tag looks like, what names are acceptable for elements, where
attributes are placed, and so forth. Superficially, the markup in an
XML document looks a lot like the markup in an HTML document, but
there are some crucial differences.
Most importantly, XML is a metamarkup language.
That means it doesn't have a fixed set of tags and
elements that are supposed to work for everybody in all areas of
interest for all time. Any attempt to create a finite set of such
tags is doomed to failure. Instead, XML allows developers and writers
to define the elements they need as they need them. Chemists can use
elements that describe molecules, atoms, bonds, reactions, and other
items encountered in chemistry. Real-estate agents can use elements
that describe apartments, rents, commissions, locations, and other
items needed for real estate. Musicians can use elements that
describe quarter notes, half notes, G-clefs, lyrics, and other
objects common in music. The X in XML stands for
Extensible. Extensible means that the language
can be extended and adapted to meet many different needs.
Although XML is quite flexible in the elements it allows to be
defined, it is quite strict in many other respects. It provides a
grammar for XML documents that says where tags may be placed, what
they must look like, which element names are legal, how attributes
are attached to elements, and so forth. This grammar is specific
enough to allow the development of XML parsers that can read any XML
document. Documents that satisfy this grammar are said to be
well-formed.
Documents that are not well-formed are not allowed, any more than a C
program that contains a syntax error is allowed. XML processors will
reject documents that contain well-formedness errors.
For reasons of interoperability, individuals or organizations may
agree to use only certain tags. These tag sets are called XML
applications. An XML application is not a software
application that uses XML, such as Mozilla or Microsoft Word. Rather,
it's an application of XML in a particular domain
like vector graphics or cooking.
The markup in an XML document describes the structure of the
document. It lets you see which elements are associated with which
other elements. In a well-designed XML document, the markup also
describes the document's semantics. For instance,
the markup can indicate that an element is a date or a person or a
bar code. In well-designed XML applications, the markup says nothing
about how the document should be displayed. That is, it does not say
that an element is bold or italicized or a list item. XML is a
structural and semantic markup language, not a presentation
language.[1]
The markup permitted in a particular XML application can be
documented in a
schema. Particular document instances
can be compared to the schema. Documents that match the schema are
said to be
valid.
Documents that do not match are invalid.
Validity depends on the schema. That is, whether a document is valid
or invalid depends on which schema you compare it to. Not all
documents need to be valid. For many purposes it is enough that the
document merely be well-formed.
There
are many different
XML
schema languages, with different
levels of expressivity. The most broadly supported schema language
and the only one defined by the XML 1.0 specification itself is the
document type definition (DTD). A DTD lists all
the legal markup and specifies where and how it may be included in a
document. DTDs are optional in XML. On the other hand, DTDs may not
always be enough. The DTD syntax is quite limited and does not allow
you to make many useful statements such as "This
element contains a number" or "This
string of text is a date between 1974 and 2032." The
W3C XML Schema Language (which sometimes goes by the misleadingly
generic label schemas) does allow you to express
constraints of this nature. Besides these two, there are many other
schema languages from which to choose, including
RELAX NG,
Schematron, Hook, and
Examplotron, and this is hardly an
exhaustive list.
All current schema languages are purely declarative. However, there
are always some constraints that cannot be expressed in anything less
than a Turing complete programming language.
For example, given an XML document that represents an order, a Turing
complete language is required to multiply the
price of each order_item by its
quantity, sum them all up, and verify that the sum
equals the value of the subtotal element.
Today's schema languages are also incapable of
verifying extra-document constraints such as "Every
SKU element matches the SKU field of a record in
the products table of the inventory database." If
you're writing programs to read XML documents, you
can add code to verify statements like these, just as you would if
you were writing code to read a tab-delimited text file. The
difference is that XML parsers present you with the data in a much
more convenient format and do more of the work for you before you
have to resort to your own custom code.
1.1.1. What XML Is Not
XML is a markup language, and it is only a markup language.
It's important to remember that. The XML hype has
gotten so extreme that some people expect XML to do everything up to
and including washing the family dog.
First of all, XML is not a programming
language.
There's no such thing as an XML compiler that reads
XML files and produces executable code. You might perhaps define a
scripting language that used a native XML format and was interpreted
by a binary program, but even this application would be
unusual.[2] XML can
be used as a format for instructions to programs that do make things
happen, just like a traditional program may read a text config file
and take different action depending on what it sees there. Indeed,
there's no reason a config file
can't be XML instead of unstructured text. Some more
recent programs are beginning to use XML config files; but in all
cases it's the program taking action, not the XML
document itself. An XML document by itself simply
is. It does not do
anything.
Secondly, XML is not a network transport
protocol. XML
won't send data across the network, any more than
HTML will. Data sent across the network using HTTP, FTP, NFS, or some
other protocol might happen to be encoded in an XML format, but again
there has to be some software outside the XML document that actually
does the sending.
Finally, to mention the example where the hype most often obscures
the reality, XML is not a database. You're not going to replace
an Oracle or MySQL server with XML. A database can contain XML data, either as
a VARCHAR or a BLOB or as some custom XML data type, but the database
itself is not an XML document. You can store XML data into a database
on a server or retrieve data from a database in an XML format, but to
do this, you need to be running software written in a real
programming language such as C or Java. To store XML in a database,
software on the client side will send the XML data to the server
using an established network protocol such as TCP/IP. Software on the
server side will receive the XML data, parse it, and store it in the
database. To retrieve an XML document from a database,
you'll generally pass through some middleware
product like Enhydra that makes SQL queries against the database and
formats the result set as XML before returning it to the client.
Indeed, some databases may integrate this software code into their
core server or provide plug-ins to do it such as the Oracle XSQL
servlet. XML serves very well as a ubiquitous, platform-independent
transport format in these scenarios. However, it is not the database,
and it shouldn't be used as one.
 |  |  | I. XML Concepts |  | 1.2. Portable Data |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|