2.9. Free-Form XML and Well-Formed Documents
XML's grandfather, SGML, required that every element
and attribute be documented thoroughly with a long list of
declarations in the DTD. We'll describe what we mean
by that thorough documentation in the next section, but for now,
imagine it as a blueprint for a document. This blueprint adds
considerable overhead to the processing of a document and was a
serious obstacle to SGML's status as a popular
markup language for the Internet. HTML, which was originally
developed as an SGML instance, was hobbled by this enforced
structure, since any "valid" HTML
document had to conform to the HTML DTD. Hence, extending the
language was impossible without approval by a web committee.
XML does away with that requirement by allowing a special condition
called free-form XML. In this mode, a document has to follow
only minimal syntax rules to be acceptable. If it follows those
rules, the document is well-formed.
Following these rules is wonderfully liberating for a developer
because it means that you don't have to scan a DTD
every time you want to process a piece of XML. All a processor has to
do is make sure that minimal syntax rules are followed.
In free-form XML, you can choose the name of any element. It
doesn't have to belong to a sanctioned vocabulary,
as is the case with HTML. Including frivolous markup into your
program is a risk, but as long as you know what
you're doing, it's okay. If you
don't trust the markup to fit a pattern
you're looking for, then you need to use element and
attribute declarations, as we describe in the next section.
What are these rules? Here's a short list as seen
though a coarse-grained spyglass:
-
A document can have only one top-level element, the
document element, that contains all the other elements
and data. This element does not include the XML declaration and
document type declaration, which must precede it.
-
Every element with content must have both a start tag and an end tag.
-
Element and attribute names are case sensitive, and only certain
characters can be used (letters, underscores, hyphens, periods, and
numbers), with only letters and underscores eligible as the first
character. Colons are allowed, but only as part of a declared
namespace prefix.
-
All attributes must have values and all attribute values must be
quoted.
-
Elements may never overlap; an element's start and
end tags must both appear within the same element.
-
Certain characters, including angle brackets
(< >) and the
ampersand (&) are reserved for markup and are
not allowed in parsed content. Use character entity references
instead, or just stick the offending content into a CDATA section.
-
Empty elements must use a syntax distinguishing them from nonempty
element start tags. The syntax requires a slash (/) before
the closing bracket (>) of the tag.
You will encounter more rules, so for a more complete understanding
of well-formedness, you should either read an introductory book on
XML or look at the W3C's official recommendation at
http://www.w3.org/XML.
If you want to be able to process your document with XML-using
programs, make sure it is always well formed. (After all,
there's no such thing as non-well-formed XML.) A
tool often used to check this status is called a
well-formedness
checker, which
is a type of XML
parser that reports errors to the
user. Often, such a tool can be detailed in its analysis and give you
the exact line number in a file where the problem occurs.
We'll discuss checkers and parsers in Chapter 3, "XML Basics: Reading and Writing".
 |  |  | 2.8. Processing Instructions and Other Markup |  | 2.10. Declaring Elements and Attributes |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|