Entities (Perl and XML)

2.5. Entities

For your authoring convenience, XML has another feature called entities. An entity is useful when you need a placeholder for text or markup that would be inconvenient or impossible to just type in. It's a piece of XML set aside from your document;[8] you use an entity reference to stand in for it. An XML processor must resolve all entity references with their replacement text at the time of parsing. Therefore, every referenced entity must be declared somewhere so that the processor knows how to resolve it.

[8]Technically, the whole document is one entity, called the document entity. However, people usually use the term "entity" to refer to a subset of the document.

The Document Type Declaration (DTD) is the place to declare an entity. It has two parts, the internal subset that is part of your document, and the external subset that lives in another document. (Often, people talk about the external subset as "the DTD" and call the internal subset "the internal subset," even though both subsets together make up the whole DTD.) In both places, the method for declaring entities is the same. The document in Example 2-3 shows how this feature works.

Example 2-3. A document with entity declarations

<!DOCTYPE memo
  SYSTEM "/xml-dtds/memo.dtd"
[
  <!ENTITY companyname "Willy Wonka's Chocolate Factory">
  <!ENTITY healthplan  SYSTEM "hp.txt">
]>

<memo>
  <to>All Oompa-loompas</to>
  <para>
    &companyname; has a new owner and CEO, Charlie Bucket. Since
    our name, &companyname;, has considerable brand recognition,
    the board has decided not to change it. However, at Charlie's
    request, we will be changing our healthcare provider to the
    more comprehensive &Uuml;mpacare, which has better facilities
    for 'Loompas (text of the plan to follow). Thank you for working
    at &companyname;!
  </para>
  &healthplan;
</memo>

Let's examine the new material in this example. At the top is the DTD, a special markup instruction that contains a lot of important information, including the internal subset and a path to the external subset. Like all declarative markup (i.e., it defines something new), it starts with an exclamation point, and is followed by a keyword, DOCTYPE. After that keyword is the name of an element that will be used to contain the document. We call that element the root element or document element. This element is followed by a path to the external subset, given by SYSTEM "/xml-dtds/memo.dtd", and the internal subset of declarations, enclosed in square brackets ([ ]).

The external subset is used for declarations that will be used in many documents, so it naturally resides in another file. The internal subset is best used for declarations that are local to the document. They may override declarations in the external subset or contain new ones. As you see in the example, two entities are declared in the internal subset. An entity declaration has two parameters: the entity name and its replacement text. The entities are named companyname and healthplan.

These entities are called general entities and are distinguished from other kinds of entities because they are declared by you, the author. Replacement text for general entities can come from two different places. The first entity declaration defines the text within the declaration itself. The second points to another file where the text resides. It uses a system identifier to specify the file's location, acting much like a URL used by a web browser to find a page to load. In this case, the file is loaded by an XML processor and inserted verbatim wherever an entity is referenced. Such an entity is called an external entity.

If you look closely at the example, you'll see markup instructions of the form &name;. The ampersand (&) indicates an entity reference, where name is the name of the entity being referenced. The same reference can be used repeatedly, making it a convenient way to insert repetitive text or markup, as we do with the entity companyname.

An entity can contain markup as well as text, as is the case with healthplan (actually, we don't know what's in that entity because it's in another file, but since it's going to be a large document, you can assume it will have markup as well as text). An entity can even contain other entities, to any nesting level you want. The only restriction is that entities can't contain themselves, at any level, lest you create a circular definition that can never be constructed by the XML processor. Some XML technologies, such as XSLT, do let you have fun with recursive logic, but think of entity references as code constants -- playing with circular references here will make any parser very unhappy.

Finally, the Ü entity reference is declared somewhere in the external subset to fill in for a character that the chocolate factory's ancient text editor programs have trouble rendering -- in this case, a capital "U" with an umlaut over it: Ü. Since the referenced entity is one character wide, the reference in this case is almost more of an alias than a pointer. The usual way to handle unusual characters (the way that's built into the XML specification) involves using a numeric character entity, which, in this case, would be &#00DC;. 0x00DC is the hexadecimal equivalent of the number 220, which is the position of the U-umlaut character in Unicode (the character set used natively by XML, which we cover in more detail in the next section).

However, since an abbreviated descriptive name like Uuml is generally easier to remember than the arcane 00DC, some XML users prefer to use these types of aliases by placing lines such as this into their documents' DTDs:

<!ENTITY % Uuml &#x00DC;>

XML recognizes only five built-in, named entity references, shown in Table 2-1. They're not actually references, but are escapes for five punctuation marks that have special meaning for XML.

Table 2-1. XML entity references

Character	Entity
`<`	`<`
`>`	`>`
`&`	`&`
`"`	`"`
`'`	`'`

The only two of these references that must be used throughout any XML document are &lt and &. Element tags and entity references can appear at any point in a document. No parser could guess, for example, whether a < character is used as a less-than math symbol or as a genuine XML token; it will always assume the latter and will report a malformed document if this assumption proves false.