home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeXML in a NutshellSearch this book

20.3. XML Syntax

For each section of this reference that maps directly to an XML language structure, an informal syntax reference describes theat structure's form. The following conventions are used with these syntax blocks:

Format

Meaning

DOCTYPE

Bold text indicates literal characters that must appear as written within the document (e.g., DOCTYPE).

encoding-name

Italicized text indicates that the user must replace the text with real data. The item indicates what type of data should be inserted (e.g., encoding-name = en-us).

|

The vertical bar | indicates that only one out of a list of possible values can be selected.

[ ]

Square brackets indicate that a particular portion of the syntax is optional.

20.3.1. Global Syntax Structures

Every XML document is broken into two primary sections: the prolog and the document element. A few documents may also have comments or processing instructions that follow the root element in a sort of epilog (an unofficial term). The prolog contains structural information about the particular type of XML document you are writing, including the XML declaration and document type declaration. The prolog is optional, and if a document does not need to be validated against a DTD, it can be omitted completely. The only required structure in a well-formed XML document is the top-level document element itself.

The following syntax structures are common to the entire XML document. Unless otherwise noted within a subsequent reference item, the following structures can appear anywhere within an XML document.

Predefined Entities

Besides user-defined entity references, XML includes the five named entity references shown in Table 20-1 that can be used without being declared. These references are a subset of those available in HTML documents.

Table 20-1. Predefined entities

Entity

Character

XML declaration

<

<

<!ENTITY lt "&#38;#60;">
&gt;

>

<!ENTITY gt "&#62;">
&amp;

&

<!ENTITY amp "&#38;#38;">
&apos;

"

<!ENTITY apos "&#39;">
&quot;

"

<!ENTITY quot "&#34;">
The &lt; and &amp; entities must be used wherever < or & appear in document content. The &gt; entity is frequently used wherever > appears in document content, but is only mandatory to avoid putting the sequence ]]> into content. &apos; and &quot; are generally used only within attribute values to avoid conflicts between the value and the quotes used to contain the value.

Though the parser must recognize these entities regardless of whether they have been declared, you can declare them in your DTD without generating errors.

The presence of these "special" predefined entities creates a conundrum within an XML document. Because it is possible to use these references without declaring them, it is possible to have a valid XML document that includes references to entities that were never declared. The XML specification actually encourages document authors to declare these entities to maintain the integrity of the entity declaration-reference rule. In practical terms, declaring these entities only adds unnecessary complexity to your document.

CDATA (Character Data) Sections

<![CDATA[unescaped character & markup data]]>
XML
documents consist of markup and character data. The < or & characters cannot be included inside normal character data without using a character or entity reference, such as &amp; or &#38;. By using a reference, the resulting < and & characters are not recognized as markup by the parser, but will become part of the data stream to the parser's client application.

For large blocks of character data--particularly if the data contains markup, such as an HTML or XML fragment--the CDATA section can be used. Within a CDATA block, every character between the opening and closing tag is considered character data. Thus, special characters can be included in a CDATA section with impunity, except for the CDATA closing sequence, ]]>.

CDATA sections are very useful for tasks such as enclosing XML or HTML documents inside of tutorials explaining how to use markup, but it is difficult to process the contents of CDATA sections using XSLT, the DOM, or SAX as anything other than text.

NOTE: CDATA sections cannot be nested. The character sequence ]]> cannot appear within data that is being escaped, or the CDATA block will be closed prematurely. This situation should not be a problem ordinarily, but if an application includes XML documents as unparsed character data, it is important to be aware of this constraint. If it is necessary to include the CDATA closing sequence in the data, close the open CDATA section, include the closing characters using character references to escape them, then reopen the CDATA section to contain the rest of the character data.

Entity References

An XML entity can best be understood as a macro replacement facility, in which the replacement can be either parsed (the text becomes part of the XML document) or unparsed. If unparsed, the entity declaration points to external binary data that cannot be parsed. Additionally, the replacement text for parsed entities can come from a string or the contents of an external file. During parsing, a parsed entity reference is replaced by the substitution text that is specified in the entity declaration. The replacement text is then reparsed until no more entity or character references remain.

To simplify document parsing, two distinct types of entities are used in different situations: general and parameter. The basic syntax for referencing both entity types is almost identical, but specific rules apply to where each type can be used.

Processing Instructions

<?target [processing-instruction data]?>
Processing instructions provide an escape mechanism that allows an XML application to include instructions to an XML processor that are not part of the XML markup or character data. The processing instruction target can be any legal XML name, except xml in any combination of upper- and lowercase (see
Chapter 2). Linking to a stylesheet to provide formatting instructions for a document is a common use of this mechanism. According to the principles of XML, formatting instructions should remain separate from the actual content of a document, but some mechanism must associate the two. Processing instructions are significant only to applications that recognize them.

The notation facility can indicate exactly what type of processing instruction is included, and each individual XML application must decide what to do with the additional data. No action is required by an XML parser when it recognizes that a particular processing instruction matches a declared notation. When this facility is used, applications that do not recognize the public or system identifiers of a given processing instruction target should realize that they could not properly interpret its data portion.

Character Encoding Autodetection

 

The XML declaration must be the very first item in a document so that the XML parser can determine which character encoding was used to store the document. A chicken-and-egg problem exists, involving the XML declaration's encoding="..." clause: the parser can't parse the clause if it doesn't know what character encoding the document uses. However, since the first five characters of your document must be the string <?xml (if it includes an XML declaration), the parser can read the first few bytes of your document and, in most cases, determine the character encoding before it has read the encoding declaration.

XML Declaration

<?xml version="1.0" [encoding="encoding-name"][ standalone="yes|no"]?>
The XML declaration serves several purposes. It tells the parser what version of the specification was used, how the document is encoded, and whether the document is completely self-contained or has references to external entities.

The XML declaration, if included, must be the first thing that appears in an XML document. Nothing, except possibly a Unicode byte-order mark, may appear before this structure's initial < character.

20.3.2. DTD (Document Type Definition)

Chapter 2 explained the difference between well-formed and valid documents. Well-formed documents that include and conform to a given DTD are considered valid. Documents that include a DTD and violate the rules of that DTD are invalid. The DTD is comprised of the DOCTYPE declaration and both the internal subset (declarations contained directly within the document) and the external subset (declarations that are included from outside the main document).

General Entities

General entities are declared within the document type definition and then referenced within the document's text and attribute content. When the document is parsed, the entity's replacement text is substituted for the entity reference. The parser then resumes parsing, starting with the text that was just replaced.

General entities are declared within the DTD using a superset of the syntax used to declare parameter entities. Besides the ability to declare internal parsed entities and external parsed entities, you can declare external unparsed entities and associate an XML notation name with them.

Internal entities are used when the replacement text can be efficiently stored inline as a literal string. The replacement text within an internal entity is included completely in the entity declaration itself, obviating the need for an external file to contain the replacement text. This situation closely resembles the string replacement macro facilities found in many popular programming languages and environments:

<!ENTITY name "Replacement text">

There are two types of external entities: parsed and unparsed. When a parsed entity is referenced, the contents of the external entity are included in the document, and the XML parser resumes parsing, starting with the newly included text. When an unparsed entity is referenced, the parser supplies the application with the unparsed entity's URI, but it does not insert that data into the document or parse it. What to do with that URI is up to the application. Any entity declared with an XML notation name associated with it is an external unparsed entity, and any references to it within the document must be made using attribute values of type ENITITY or ENTITIES:

<!ENTITY name SYSTEM 
    "system-literal">
<!ENTITY name PUBLIC 
    "pubid-literal" "system-literal">
Element Type Declaration

Element type declarations provide a template for the actual element instances that appear within an XML document. The declaration determines what type of content, if any, can be contained within elements with the given name. The following sections describe the various element content options available.

NOTE: Since namespaces are not explicitly included in the XML 1.0 recommendation, element and attribute declarations within a DTD must give the complete (qualified) name that will be used in the target document. This means that if namespace prefixes will be used in instance documents, the DTD must declare them just as they will appear, prefixes and all. While parameter entities may allow instance documents to use different prefixes, this still makes complete and seamless integration of namespaces into a DTD-based application very awkward.

Empty Element Type

<!ELEMENT name EMPTY>

Elements that are declared empty cannot contain content or nested elements. Within the document, empty elements may use one of the following two syntax forms:

<name [attribute="value" ...]/>
<name [attribute="value" ...]></name>
Any Element Type

<!ELEMENT name ANY>

This content specifier acts as a wildcard, allowing elements of this type to contain character data or instances of any valid element types that are declared in the DTD.

Mixed Content Element Type

<!ELEMENT name (#PCDATA [ | name]+)*>
<!ELEMENT name (#PCDATA)>
Element declarations that include the #PCDATA token can include text content mixed with other nested elements that are declared in the optional portion of the element declaration. If the #PCDATA token is used, it is not possible to limit the number of times or sequence in which other nested elements are mixed with the parsed character data. If only text content is desired, the asterisk is optional.

NOTATION Attribute Type


... NOTATION (notation [| notation]*) ...
The NOTATION attribute mechanism lets XML document authors indicate that the character content of some elements obey the rules of some formal language other than XML. The following short sample document shows how notations might be used to specify the type of programming language stored in the code_fragment element:

<?xml version="1.0"?>
<!DOCTYPE code_fragment
[
<!NOTATION java_code PUBLIC "Java source code">
<!NOTATION c_code PUBLIC "C source code">
<!NOTATION perl_code PUBLIC "Perl source code">
<!ELEMENT code_fragment (#PCDATA)>
<!ATTLIST code_fragment
          code_lang NOTATION (java_code | c_code | perl_code) #REQUIRED>

]>
<code_fragment code_lang="c_code">
    main( ) { printf("Hello, world."); }
</code_fragment>
Enumeration Attribute Type


... (name_token [| name_token]*) ...
This syntax limits the possible values of the given attribute to one of the name tokens from the provided list:

<!ELEMENT door EMPTY>
<!ATTLIST door
          state (open | closed | missing) "open">
. . .
<door state="closed"/>
Notation Declaration


<!NOTATION notation_name SYSTEM "system-literal">
<!NOTATION notation_name PUBLIC "pubid-literal">
<!NOTATION notation_name PUBLIC "pubid-literal" "system-literal">
Notation declarations are used to provide information to an XML application about the format of the document's unparsed content. Notations are used by unparsed external entities, processing instructions, and some attribute values.

Notation information is not significant to the XML parser, but it is preserved for use by the client application. The public and system identifiers are made available to the client application so that it may correctly interpret non-XML data and processing instructions.

20.3.3. Document Body

Elements are an XML document's lifeblood. They provide the structure for character data and attribute values that make up a particular instance of an XML document type definition. The !ELEMENT and !ATTLIST declarations from the DTD restrict the possible contents of an element within a valid XML document. Combining elements and/or attributes that violate these restrictions generates an error in a validating parser.

Start-Tags and End-Tags



<element_name [attribute_name="attribute value"]*> ...</element_name>
Elements that have content (either character data, other elements, or both) must start with a start-tag and end with an element end-tag.

Attributes

attribute_name="attribute value"
attribute_name='attribute value'
Elements may include attributes. The order of attributes within an element tag is not significant and is not guaranteed to be preserved by an XML parser. Attribute values must appear within either single or double quotations. Attribute values within a document must conform to the rules explained in
Section 20.4.1 of this chapter.

Note that whitespace may appear around the = character.

The value that appears in the quoted string is tested for validity, depending on the attribute type provided in the !ATTLIST declaration for the element type. Attribute values can contain general entity references, but cannot contain references to external parsed entities. See Section 20.4.1 of this chapter for more information about attribute-value restrictions.

20.3.4. Namespaces

Although namespace support was not part of the original XML 1.0 recommendation, Namespaces in XML was approved less than a year later (January 14, 1999). Namespaces are used to identify uniquely the element and attribute names of a given XML application from those of other applications. See Chapter 4 for more detailed information.

The following sections describe how namespaces impact the formation and interpretation of element and attribute names within an XML document.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.