home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeWebmaster in a Nutshell, 3rd EditionSearch this book

10.2. XML Reference

Now that you have had a quick taste of working with XML, here is an overview of the more common rules and constructs of the XML language.

10.2.1. Well-Formed XML

These are the rules for a well-formed XML document:

  • All element attribute values must be in quotation marks.

  • An element must have both an opening and a closing tag, unless it is an empty element.

  • If a tag is a standalone empty element, it must contain a closing slash (/) before the end of the tag.

  • All opening and closing element tags must nest correctly.

  • Isolated markup characters are not allowed in text; < or & must use entity references. In addition, the sequence ]]> must be expressed as ]]&gt; when used as regular text. (Entity references are discussed in further detail later.)

  • Well-formed XML documents without a corresponding DTD must have all attributes of type CDATA by default.

10.2.2. Special Markup

XML uses the following special markup constructs.

<?xml ...?>

Although they are not required to, XML documents typically begin with an XML declaration, which must start with the characters <?xml and end with the characters ?>. Attributes include:

Attributes

version
The version attribute specifies the correct version of XML required to process the document, which is currently 1.0. This attribute cannot be omitted.

encoding
The encoding attribute specifies the character encoding used in the document (e.g., UTF-8 or iso-8859-1). UTF-8 and UTF-16 are the only encodings that an XML processor is required to handle. This attribute is optional.

standalone
The optional standalone attribute specifies whether an external DTD is required to parse the document. The value must be either yes or no (the default). If the value is no or the attribute is not present, a DTD must be declared with an XML <!DOCTYPE> instruction. If it is yes, no external DTD is required.

For example:

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml version="number"
[encoding="encoding"]
[standalone="yes|no"] ?>
<?...?>

A processing instruction allows developers to place attributes specific to an outside application within the document. Processing instructions always begin with the characters <? and end with the characters ?>. For example:

<?works document="hello.doc" data="hello.wks"?>

You can create your own processing instructions if the XML application processing the document is aware of what the data means and acts accordingly.

<?target attribute1="value"
attribute2="value" 
... ?>
<!DOCTYPE>

The <!DOCTYPE> instruction allows you to specify a DTD for an XML document. This instruction currently takes one of two forms:

<!DOCTYPE root-element SYSTEM "URI_of_DTD">
<!DOCTYPE root-element PUBLIC "name" "URI_of_DTD">

Keywords

SYSTEM
The SYSTEM variant specifies the URI location of a DTD for private use in the document. For example:

<!DOCTYPE Book SYSTEM
   "http://mycompany.com/dtd/mydoctype.dtd">
PUBLIC
The PUBLIC variant is used in situations in which a DTD has been publicized for widespread use. In these cases, the DTD is assigned a unique name, which the XML processor may use by itself to attempt to retrieve the DTD. If this fails, the URI is used:

<!DOCTYPE Book PUBLIC "-//O'Reilly//DTD//EN"
   "http://www.oreilly.com/dtd/xmlbk.dtd">

Public DTDs follow a specific naming convention. See the XML specification for details on naming public DTDs.

<!DOCTYPE root-element SYSTEM|PUBLIC
["name"] "URI_of_DTD">
<!— ... —>

You can place comments anywhere in an XML document, except within element tags or before the initial XML processing instructions. Comments in an XML document always start with the characters <!-- and end with the characters -->. In addition, they may not include double hyphens within the comment. The contents of the comment are ignored by the XML processor. For example:

<!-- Sales Figures Start Here -->
<Units>2000</Units>
<Cost>49.95</Cost>
<!-- comments -->
CDATA

You can define special sections of character data, or CDATA, which the XML processor does not attempt to interpret as markup. Anything included inside a CDATA section is treated as plain text. CDATA sections begin with the characters <![CDATA[ and end with the characters ]]>. For example:

<![CDATA[
   Im now discussing the <element> tag of documents
   5 & 6: "Sales" and "Profit and Loss". Luckily,
   the XML processor wont apply rules of formatting
   to these sentences!
]]>

Note that entity references inside a CDATA section will not be expanded.

<![CDATA[ ... ]]>

10.2.3. Element and Attribute Rules

An element is either bound by its start and end tags or is an empty element. Elements can contain text, other elements, or a combination of both. For example:

<para>
   Elements can contain text, other elements, or
   a combination. For example, a chapter might 
   contain a title and multiple paragraphs, and 
   a paragraph might contain text and 
   <emphasis>emphasis elements</emphasis>.
</para>

An element name must start with a letter or an underscore. It can then have any number of letters, numbers, hyphens, periods, or underscores in its name. Elements are case-sensitive: <Para>, <para>, and <pArA> are considered three different element types.

Element type names may not start with the string xml in any variation of upper- or lowercase. Names beginning with xml are reserved for special uses by the W3C XML Working Group. Colons (:) are permitted in element type names only for specifying namespaces; otherwise, colons are forbidden. For example:

Example

Comment

<Italic>

Legal

<_Budget>

Legal

<Punch line>

Illegal: has a space

<205Para>

Illegal: starts with number

<repair@log>

Illegal: contains @ character

<xmlbob>

Illegal: starts with xml

Element type names can also include accented Roman characters, letters from other alphabets (e.g., Cyrillic, Greek, Hebrew, Arabic, Thai, Hiragana, Katakana, or Devanagari), and ideograms from the Chinese, Japanese, and Korean languages. Valid element type names can therefore include <são>, <peut-être>, <più>, and <niño>, plus a number of others our publishing system isn't equipped to handle.

If you use a DTD, the content of an element is constrained by its DTD declaration. Better XML applications inform you which elements and attributes can appear inside a specific element. Otherwise, you should check the element declaration in the DTD to determine the exact semantics.

Attributes describe additional information about an element. They always consist of a name and a value, as follows:

<price currency="Euro">

The attribute value is always quoted, using either single or double quotes. Attribute names are subject to the same restrictions as element type names.

10.2.4. XML Reserved Attributes

The following are reserved attributes in XML.

xml:lang

The &xml:lang; attribute can be used on any element. Its value indicates the language of the body of the element. This is useful in a multilingual context. For example, you might have:

<para xml:lang="en">Hello</para>
<para xml:lang="fr">Bonjour</para>

This format allows you to display one element or the other, depending on the user's language preference.

The syntax of the &xml:lang; value is defined by ISO-639. A two-letter language code is optionally followed by a hyphen and a two-letter country code. Traditionally, the language is given in lowercase and the country in uppercase (and for safety, this rule should be followed), but processors are expected to use the values in a case-insensitive manner.

In addition, ISO-3166 provides extensions for nonstandardized languages or language variants. Valid &xml:lang; values include notations such as en, en-US, en-UK, en-cockney, i-navajo, and x-minbari.

xml:lang="iso_639_identifier"
xml:space

The &xml:space; attribute indicates whether any whitespace inside the element is significant and should not be altered by the XML processor. The attribute can take one of two enumerated values:

preserve
The XML application preserves all whitespace (newlines, spaces, and tabs) present within the element.

default
The XML processor uses its default processing rules when deciding to preserve or discard the whitespace inside the element.

You should set &xml:space; to preserve only if you want an element to behave like the HTML <pre> element, such as when it documents source code.

xml:space="default|preserve"


Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.