XML Syntax (XML in a Nutshell, 2nd Edition)

For each section of this reference that maps directly to an XML language structure, an informal syntax reference describes theat structure's form. The following conventions are used with these syntax blocks:

Format

Meaning

DOCTYPE

Bold text indicates literal characters that must appear as written within the document (e.g., DOCTYPE).

encoding-name

Italicized text indicates that the user must replace the text with real data. The item indicates what type of data should be inserted (e.g., encoding-name = en-us).

The vertical bar | indicates that only one out of a list of possible values can be selected.

[ ]

Square brackets indicate that a particular portion of the syntax is optional.

Entity

Character

XML declaration

&lt;

<!ENTITY lt "&#38;#60;">

&gt;

<!ENTITY gt "&#62;">

&amp;

<!ENTITY amp "&#38;#38;">

&apos;

<!ENTITY apos "&#39;">

&quot;

<!ENTITY quot "&#34;">

20.3.2. DTD (Document Type Definition)

Chapter 2 explained the difference between well-formed and valid documents. Well-formed documents that include and conform to a given DTD are considered valid. Documents that include a DTD and violate the rules of that DTD are invalid. The DTD is comprised of the DOCTYPE declaration and both the internal subset (declarations contained directly within the document) and the external subset (declarations that are included from outside the main document).

Parameter Entities

The parameter entity mechanism is a simple macro replacement facility that is only valid within the context of the DTD. Parameter entities are declared and then referenced from within markup or possibly from within other entity declarations. The source of the entity replacement text can be either a literal string or the contents of an external file. Parameter entities simplify maintenance of large, complex documents by allowing authors to build libraries of commonly used entity declarations.

Parameter Entity Declarations

<!ENTITY % name "Replacement text.">
<!ENTITY % name SYSTEM
     "system-literal">
<!ENTITY % name PUBLIC "pubid-literal" 
    "system-literal">

Parameter entities are declared within the document's DTD and must be declared before they are used. The declaration provides two key pieces of information:

The name of the entity, which is used when it is referenced
The replacement text, either directly or indirectly through a link to an external entity

Be aware that an XML parser performs some preprocessing on the replacement text before it is used in an entity reference. Most importantly, parameter entity references in the replacement text are recursively expanded before the final version of the replacement text is stored. Character references are also replaced immediately with the specified character. This replacement can lead to unexpected side effects, particularly when constructing parameter entities that declare other parameter entities. For full disclosure of how entity replacement is implemented by an XML parser and what kinds of unexpected side effects can occur, see Appendix D of the XML 1.0 specification. The specification is available on the World Wide Web Consortium web site (http://www.w3.org/TR/REC-xml#sec-entexpand ).

General Entities

General entities are declared within the document type definition and then referenced within the document's text and attribute content. When the document is parsed, the entity's replacement text is substituted for the entity reference. The parser then resumes parsing, starting with the text that was just replaced.

General entities are declared within the DTD using a superset of the syntax used to declare parameter entities. Besides the ability to declare internal parsed entities and external parsed entities, you can declare external unparsed entities and associate an XML notation name with them.

Internal entities are used when the replacement text can be efficiently stored inline as a literal string. The replacement text within an internal entity is included completely in the entity declaration itself, obviating the need for an external file to contain the replacement text. This situation closely resembles the string replacement macro facilities found in many popular programming languages and environments:

<!ENTITY name "Replacement text">

There are two types of external entities: parsed and unparsed. When a parsed entity is referenced, the contents of the external entity are included in the document, and the XML parser resumes parsing, starting with the newly included text. When an unparsed entity is referenced, the parser supplies the application with the unparsed entity's URI, but it does not insert that data into the document or parse it. What to do with that URI is up to the application. Any entity declared with an XML notation name associated with it is an external unparsed entity, and any references to it within the document must be made using attribute values of type ENITITY or ENTITIES:

<!ENTITY name SYSTEM 
    "system-literal">
<!ENTITY name PUBLIC 
    "pubid-literal" "system-literal">

Text Declarations

<?xml[ version="1.0"] encoding="encoding-name"?>

Files that contain external parsed entities must include a text declaration if the entity file uses a character encoding other than UTF-8 or UTF-16. This declaration would be followed by the replacement text of the external parsed entity.

NOTE: External parsed entities may contain only document content or a completely well-formed subset of the DTD. This restriction is significant because it indicates that external parameter entities cannot be used to play token-pasting games by splitting XML syntax constructs into multiple files, then expecting the parser to reassemble them.

Unparsed Entities

It may be necessary at times to include data in your XML document that should not be parsed. For instance, your XML document may need to include pointers to graphics files that will be used by an application. These files are logically part of the document, but should not be parsed. The XML language allows you to declare external unparsed entities that can be included as attribute values within the content of your document:

<!ENTITY name SYSTEM  
    "system-literal" NDATA notation_name >
<!ENTITY name PUBLIC "pubid-literal "
    "system-literal" NDATA notation_name >

To include unparsed entities, you must first declare a notation that will be referenced in the actual entity declaration:

<!NOTATION gif SYSTEM "images/gif">

Then declaring the entity itself is possible:

<!ENTITY bookcase_pic SYSTEM "bookcase.gif" NDATA gif>

As an unparsed general entity, it can be referenced only as an attribute value of type ENTITY or ENTITIES:

<picture src="bookcase_pic" type="gif"/>

When an XML parser parses this element, the information contained in the entity and notation declarations can be used to identify the actual type of data stored in the external entity. For example, a program could choose to display the contents of a GIF external entity on the screen, once the actual format is known.

NOTE: XLink and similar mechanisms are commonly used in place of unparsed entities.

External Subset

The document type declaration can include part or all of the document type definition from an external file. This external portion of the DTD is referred to as the external DTD subset and may contain markup declarations, conditional sections, and parameter entity references. It must include a text declaration if the character encoding is not UTF-8 or UTF-16:

<?xml[ version="1.0"] encoding="encoding-name"?>

This declaration (if present) would then be followed by a series of complete DTD markup statements, including ELEMENT, ATTLIST, ENTITY, and NOTATION declarations, as well as conditional sections, and processing instructions. For example:

<!ELEMENT furniture_item (desc, %extra_tags; user_tags?, parts_list, 
    assembly+)>

<!ATTLIST furniture_item
    xmlns CDATA #FIXED "http://namespaces.oreilly.com/furniture/"
>
...

Internal DTD Subset

The internal DTD subset is the portion of the document type definition included directly within the document type declaration between the [ and ] characters. The internal DTD subset can contain markup declarations and parameter entity references, but not conditional sections. A single document may have both internal and external DTD subsets, which, when taken together, form the complete document type definition. The following example shows the internal subset, which appears between the [ and ] characters:

<!DOCTYPE furniture_item SYSTEM "furniture.dtd"
[
<!ENTITY % bookcase_ex SYSTEM "Bookcase_ex.ent">

%bookcase_ex;

<!ENTITY bookcase_pic SYSTEM "bookcase.gif" NDATA gif>
<!ENTITY parts_list SYSTEM "parts_list.ent">
]>

Element Type Declaration

Element type declarations provide a template for the actual element instances that appear within an XML document. The declaration determines what type of content, if any, can be contained within elements with the given name. The following sections describe the various element content options available.

NOTE: Since namespaces are not explicitly included in the XML 1.0 recommendation, element and attribute declarations within a DTD must give the complete (qualified) name that will be used in the target document. This means that if namespace prefixes will be used in instance documents, the DTD must declare them just as they will appear, prefixes and all. While parameter entities may allow instance documents to use different prefixes, this still makes complete and seamless integration of namespaces into a DTD-based application very awkward.

Empty Element Type
<!ELEMENT name EMPTY>
Elements that are declared empty cannot contain content or nested elements. Within the document, empty elements may use one of the following two syntax forms:
<name [attribute="value" ...]/> <name [attribute="value" ...]></name>

Any Element Type
<!ELEMENT name ANY>
This content specifier acts as a wildcard, allowing elements of this type to contain character data or instances of any valid element types that are declared in the DTD.

Mixed Content Element Type
<!ELEMENT name (#PCDATA [ | name]+)*> <!ELEMENT name (#PCDATA)>
Element declarations that include the #PCDATA token can include text content mixed with other nested elements that are declared in the optional portion of the element declaration. If the #PCDATA token is used, it is not possible to limit the number of times or sequence in which other nested elements are mixed with the parsed character data. If only text content is desired, the asterisk is optional.

Constrained Child Nodes
<!ELEMENT name (child_node_regexp)[? | * | +]>
XML provides a simple regular-expression syntax that can be used to limit the order and number of child elements within a parent element. This language includes the following operators:

Operator

Meaning

Name

Matches an element of the given name

( ... )

Groups expressions for processing as sets of sequences (using the comma as a separator) or choices (using | as a separator)

?

Indicates that the preceding name or expression can occur zero or one times at this point in the document

*

Indicates that the preceding name or expression can occur zero or more times at this point in the document

+

Indicates that the preceding name or expression must occur one or more times at this point in the document

Attribute List Declaration
<!ATTLIST element_name [attribute_name attribute_type default_decl]*>
In a valid XML document it is necessary to declare the attribute names, types, and default values that are used with each element type.
The attribute name must obey the rules for XML identifiers, and no duplicate attribute names may exist within a single declaration.
Attributes are declared as having a specific type. Depending on the declared type, a validating XML parser will constrain the values that appear in instances of those attributes within a document. The following table lists the various attribute types and their meanings:

Attribute type

Meaning

CDATA

Simple character data.

ID

A unique ID value within the current XML document. No two ID attribute values within a document can have the same value, and no element can have two attributes of type ID.

IDREF, IDREFS

A single reference to an element ID (IDREF) or a list of IDs (IDREFS), separated by spaces. Every ID token must refer to a valid ID located somewhere within the document that appears as the ID type attribute's value.

ENTITY, ENTITIES

A single reference to a declared unparsed external entity (ENTITY) or a list of references (ENTITIES), separated by spaces.

NMTOKEN, NMTOKENS

A single name token value (NMTOKEN) or a list of name tokens (NMTOKENS), separated by spaces.

NOTATION Attribute Type
... NOTATION (notation [| notation]*) ...
The NOTATION attribute mechanism lets XML document authors indicate that the character content of some elements obey the rules of some formal language other than XML. The following short sample document shows how notations might be used to specify the type of programming language stored in the code_fragment element:
<?xml version="1.0"?> <!DOCTYPE code_fragment [ <!NOTATION java_code PUBLIC "Java source code"> <!NOTATION c_code PUBLIC "C source code"> <!NOTATION perl_code PUBLIC "Perl source code"> <!ELEMENT code_fragment (#PCDATA)> <!ATTLIST code_fragment code_lang NOTATION (java_code | c_code | perl_code) #REQUIRED> ]> <code_fragment code_lang="c_code"> main( ) { printf("Hello, world."); } </code_fragment>

Enumeration Attribute Type
... (name_token [| name_token]*) ...
This syntax limits the possible values of the given attribute to one of the name tokens from the provided list:
<!ELEMENT door EMPTY> <!ATTLIST door state (open | closed | missing) "open"> . . . <door state="closed"/>

Default Values
If an optional attribute is not present on a given element, a default value may be provided to be passed by the XML parser to the client application. The following table shows various forms of the attribute default value clause and their meanings:

Default value clause

Explanation

#REQUIRED

A value must be provided for this attribute.

#IMPLIED

A value may or may not be provided for this attribute.

[#FIXED ] "default value"

If this attribute has no explicit value, the XML parser substitutes the given default value. If the #FIXED token is provided, this attribute's value must match the given default value. In either case, the parent element always has an attribute with this name.

The #FIXED modifier indicates that the attribute may contain only the value given in the attribute declaration. Although redundant, it is possible to provide an explicit attribute value on an element when the attribute was declared as #FIXED. The only restriction is that the attribute value must exactly match the value given in the #FIXED declaration.

Special Attributes
Some attributes are significant to XML and must be declared and implemented in a particular way:
xml:space
The xml:space attribute tells an XML application whether the whitespace within the specified element is significant:

<!ATTLIST element_name xml:space (default|preserve) default_decl> <!ATTLIST element_name xml:space (default) #FIXED 'default' > <!ATTLIST element_name xml:space (preserve) #FIXED 'preserve' >
xml:lang

For an element's character content, the xml:lang attribute allows a document author to specify the human language for an element's character content. If used in a valid XML document, the document type definition must include an attribute type declaration with the xml:lang attribute name. See Chapter 5 for an explanation of language support in XML.

Notation Declaration
<!NOTATION notation_name SYSTEM "system-literal"> <!NOTATION notation_name PUBLIC "pubid-literal"> <!NOTATION notation_name PUBLIC "pubid-literal" "system-literal">
Notation declarations are used to provide information to an XML application about the format of the document's unparsed content. Notations are used by unparsed external entities, processing instructions, and some attribute values.
Notation information is not significant to the XML parser, but it is preserved for use by the client application. The public and system identifiers are made available to the client application so that it may correctly interpret non-XML data and processing instructions.

Conditional Sections
The conditional section markup provides support for conditionally including and excluding content at parse time within an XML document's external subset. Conditional sections are not allowed within a document's internal subset. The following example illustrates a likely application of conditional sections:
<!ENTITY % debug 'IGNORE' >
<!ENTITY % release 'INCLUDE' >
 
<!ELEMENT addend (#PCDATA)>
<!ELEMENT result (#PCDATA)>
 
<![%debug;[
<!ELEMENT sum (addend+, result)>
]]>
<![%release;[
<!ELEMENT sum (result)>
]]>

20.3. XML Syntax

20.3.1. Global Syntax Structures

Table 20-1. Predefined entities

Character Encoding Autodetection

20.3.2. DTD (Document Type Definition)

20.3.3. Document Body

20.3.4. Namespaces