Creating Links and Cross-References (XSLT)

If you're creating a web site, publishing a book, or creating an XML transaction, chances are many pieces of information will refer to other things. This chapter discusses a several ways to link XML elements. It reviews three techniques:

5.1. Generating Links with the id() Function

Our first attempt at linking will be with the XPath id() function.

5.1.1. The ID, IDREF, and IDREFs Datatypes

Three of the basic datatypes supported by XML Document Type Definitions (DTDs) are ID, IDREF, and IDREFS. Here's a simple DTD that illustrates these datatypes:

<!--glossary.dtd-->
<!--The containing tag for the entire glossary-->
<!ELEMENT glossary  (glentry+) >

<!--A glossary entry-->
<!ELEMENT glentry  (term,defn+) >

<!--The word being defined-->
<!ELEMENT term  (#PCDATA) >

<!--The id is used for cross-referencing, and the 
    xreftext is the text used by cross-references.-->
<!ATTLIST term
               id  ID    #REQUIRED 
               xreftext  CDATA    #IMPLIED  >

<!--The definition of the term-->
<!ELEMENT defn  (#PCDATA | xref | seealso)* >

<!--A cross-reference to another term-->
<!ELEMENT xref   EMPTY  >

<!--refid is the ID of the referenced term-->
<!ATTLIST xref
               refid  IDREF    #REQUIRED >

<!--seealso refers to one or more other definitions-->
<!ELEMENT seealso EMPTY>
<!ATTLIST seealso
                  refids   IDREFS  #REQUIRED >

In this DTD, each <term> element is required to have an id attribute, and each <xref> element must have an refid attribute. The ID and IDREF datatypes work according to two rules:

Each value of the id attribute must be unique.
Each value of the refid attribute must match a value of an id attribute elsewhere in the document.

To round out our example, the <seealso> element contains an attribute of type IDREFS. This datatype contains one or more values, each of which must match a value of an ID elsewhere in the document. Multiple values, if present, are separated by whitespace.

There are some complications of ID and related datatypes, but we'll discuss them later. For now, we'll focus on how the id() function works.

5.1.2. An XML Document in Need of Links

To illustrate the value of linking, we'll use a small glossary written in XML. The glossary contains some <glentry> elements, each of which contains a single <term> and one or more <defn> elements. In addition, a definition is allowed to contain a cross-reference (<xref>) to another <term>. Here's a short sample document:

<?xml version="1.0" ?>
<!DOCTYPE glossary SYSTEM "glossary.dtd">
<glossary>
  <glentry>
    <term id="applet">applet</term>
    <defn>
      An application program,
      written in the Java programming language, that can be 
      retrieved from a web server and executed by a web browser. 
      A reference to an applet appears in the markup for a web 
      page, in the same way that a reference to a graphics
      file appears; a browser retrieves an applet in the same 
      way that it retrieves a graphics file. 
      For security reasons, an applet's access rights are limited
      in two ways: the applet cannot access the file system of the 
      client upon which it is executing, and the applet's 
      communication across the network is limited to the server 
      from which it was downloaded. 
      Contrast with <xref refid="servlet"/>.
      <seealso refids="wildcard-char DMZlong pattern-matching"/>
    </defn>
  </glentry>

  <glentry>
    <term id="DMZlong" xreftext="demilitarized zone">demilitarized 
      zone (DMZ)</term>
    <defn>
      In network security, a network that is isolated from, and 
      serves as a neutral zone between, a trusted network (for example, 
      a private intranet) and an untrusted network (for example, the
      Internet). One or more secure gateways usually control access 
      to the DMZ from the trusted or the untrusted network.
    </defn>
  </glentry>

  <glentry>
    <term id="DMZ">DMZ</term>
    <defn>
      See <xref refid="DMZlong"/>.
    </defn>
  </glentry>

  <glentry>
    <term id="pattern-matching">pattern-matching character</term>
    <defn>
      A special character such as an asterisk (*) or a question mark 
      (?) that can be used to represent zero or more characters. 
      Any character or set of characters can replace a pattern-matching 
      character.
    </defn>
  </glentry>

  <glentry>
    <term id="servlet">servlet</term>
    <defn>
      An application program, written in the Java programming language, 
      that is executed on a web server. A reference to a servlet 
      appears in the markup for a web page, in the same way that a 
      reference to a graphics file appears. The web server executes
      the servlet and sends the results of the execution (if there are
      any) to the web browser. Contrast with <xref refid="applet" />.
    </defn>
  </glentry>

  <glentry>
    <term id="wildcard-char">wildcard character</term>
    <defn>
      See <xref refid="pattern-matching"/>.
    </defn>
  </glentry>
</glossary>

In this XML listing, each <term> element has an id attribute that identifies it uniquely. Many <xref> elements also refer to other terms in the listing. Notice that each time we refer to another term, we don't use the actual text of the referenced term. When we write our stylesheet, we'll use the XPath id function to retrieve the text of the referenced term; if the name of a term changes (as buzzwords go in and out of fashion, some marketing genius might want to rename the "pattern-matching character," for example), we can rerun our stylesheet and be confident that all references to the new term contain the correct text.

Finally, some <term> elements have an xreftext element because some of the actual terms are longer than we'd like to use in a cross-reference. When we have an <xref> to the term ASCII (American Standard Code for Information Interchange), it would get pretty tedious if the entire text of the term appeared throughout our document. For this term, we'll use the xreftext attribute's value, ensuring that the cross-reference contains the less-intimidating text ASCII.

5.1.3. A Stylesheet That Uses the id() Function

Let's look at our desired output. What we want is an HTML document, such as that shown in Figure 5-1, that displays the various definitions in an easy-to-read format, with the cross-references formatted as hyperlinks.

In the HTML document, we'll need to address several things in our stylesheet:

The <title> and the <h1> contain the first and last terms in the glossary. We can use XPath expressions to generate that information.
The <xref> elements have been replaced with the xreftext attribute of the referenced <term> element, if there is one. If that attribute doesn't exist, <xref> is replaced by the text of the <term> element. We'll use the id() function to find the referenced <term>, and we'll use XSLT's control elements to check if the xreftext attribute exists.
The hyperlinks generated from the <xref> elements refer to a named anchor point elsewhere in the HTML document. If <xref> elements refer to a given <term>, we have to create a named anchor (<a name="...">) at the location of the referenced <term>. To simplify things, we'll generate a named anchor for each term automatically, using the id attribute (required to be unique by our DTD) as the name of the anchor.
We need to process any <seealso> elements, as well. These elements are handled similarly to the <xref> elements, the main difference being that the refids attribute of the <seealso> element can refer to more than one glossary entry.

Figure 5-1. HTML document with generated cross-references

Here's the template that takes care of our first task, generating the HTML <title> and the <h1>:

<xsl:template match="glossary">
  <html>
    <head>
      <title>
        <xsl:text>Glossary Listing: </xsl:text>
        <xsl:value-of select="glentry[1]/term"/>
        <xsl:text> - </xsl:text>
        <xsl:value-of select="glentry[last()]/term"/>
      </title>
    </head>
    <body>
      <h1>
        <xsl:text>Glossary Listing: </xsl:text>
        <xsl:value-of select="glentry[1]/term"/>
        <xsl:text> - </xsl:text>
        <xsl:value-of select="glentry[last()]/term"/>
      </h1>
      <xsl:apply-templates select="glentry"/>
    </body>
  </html>
</xsl:template>

We generate the <title> and <h1> using the XPath expressions glentry[1]/term for the first <term> in the document, and using glentry[last()]/term for the last term.

Our next step is to process all the <glentry> elements. We'll generate an HTML paragraph for each one, and then we'll generate a named anchor point, using the id attribute as the name of the anchor. Here's the template:

<xsl:template match="glentry">
  <p>
    <b>
      <a name="{@id}"/>
      <xsl:value-of select="term"/>
      <xsl:text>: </xsl:text>
    </b>
    <xsl:apply-templates select="defn"/>
  </p>
</xsl:template>

In this template, we're using an attribute value template to generate the name attribute of the HTML <a> element. The XPath expression @id retrieves the id attribute of the <glentry> element we're currently processing. We use this attribute to generate a named anchor. We then write the term itself in bold and apply the template for the <defn> element. In our output document, each glossary entry contains a paragraph with the highlighted term and its definition.

The name attribute of this HTML <a> element is generated with an attribute value template. See Section 3.3, "Attribute Value Templates" for more information.

Our next step is to process the cross-reference. Here's the template for the <xref> element:

<xsl:template match="xref">
  <a href="#{@refid}">
    <xsl:choose>
      <xsl:when test="id(@refid)/@xreftext">
        <xsl:value-of select="id(@refid)/@xreftext"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="id(@refid)"/>
      </xsl:otherwise>
    </xsl:choose>
  </a>
</xsl:template>

We create the <a> element in two steps:

Create the href attribute. It must refer to the correctly named anchor in the HTML document.
Create the text of the link. This text is the word or phrase that appears in the browser; clicking on the link should take the user to the referenced term.

For the first step, we know that the href attribute must contain a hash mark (#) followed by the name of the anchor point. Because we generated all the named anchors from the id attributes of the various <glentry> elements, we know the name of the anchor point is the same as the id.

Now all that's left is for us to retrieve the text. This retrieval is the most complicated part of the process (relatively speaking, anyway). Remember that we want to use the xreftext attribute of the <term> element, if there is one, and use the text of the <term> element, otherwise. To implement an if-then-else statement, we use the <xsl:choose> element. In the previous sample, we used a test expression of id(@refid)/@xreftext to see if the xreftext attribute exists. (Remember, an empty node-set is considered false. If the attribute doesn't exist, the node-set will be empty and the <xsl:otherwise> element will be evaluated.) If the test is true, we use id(@refid)/@xreftext to retrieve the cross-reference text. The first part of the XPath expression (id(@refid)) returns the node that has an ID that matches the value @refid; the second part (@xreftext) retrieves the xreftext attribute of that node. We insert the text of the xreftext attribute inside the <a> element.

Finally, we handle any <seealso> elements. The difference here is that the refids attribute can reference any number of glossary terms, so we'll use the id() function differently. Here's the template for <seealso>:

<xsl:template match="seealso">
  <b>
    <xsl:text>See also: </xsl:text>
  </b>
  <xsl:for-each select="id(@refids)">
    <a href="#{@id}">
      <xsl:choose>
        <xsl:when test="@xreftext">
          <xsl:value-of select="@xreftext"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="."/>
        </xsl:otherwise>
      </xsl:choose>
    </a>
    <xsl:if test="not(position()=last())">
      <xsl:text>, </xsl:text>
    </xsl:if>
  </xsl:for-each>
  <xsl:text>. </xsl:text>
</xsl:template>

There are a couple of important differences here. First, we call the id() function in an <xsl:for-each> element. Calling the id() function with an attribute of type IDREFS returns a node-set; each node in the node-set is the match for one of the IDs in the attribute.

The second difference is that referencing the correctly named anchor is more difficult. When we processed the <xref> element, we knew that the correct anchor name was the value of the refid attribute. When processing <seealso>, the refids attribute doesn't do us any good because it may contain any number of IDs. All is not lost, however. What we did previously was use the id attribute of each node returned by the id() function -- a minor inconvenience, but another difference in processing an attribute of type IDREFS instead of IDREF.

The final difference is that we want to add commas after all items except the last. The <xsl:if> element shown previously does just this. If the position() of the current item is the last, we don't output the comma and space (defined here with the <xsl:text> element). We formatted all references here as a sentence; as an exercise, feel free to process the items in a more sophisticated way. For example, you could generate an HTML list from the IDREFS, or maybe format things differently if the refids attribute only contains a single ID.

We've done several useful things with the id() function. We've been able to use attributes of type ID to discover the links between related pieces of information, and we've converted the XML into HTML links, renderable in an ordinary household browser. If this is the only kind of linking and referencing you need to do, that's great. Unfortunately, there are times when we need to do more, and on those occasions, the id() function doesn't quite cut it. We'll mention the limitations of the id() function briefly, then we'll discuss XSLT functions that let us overcome them.

5.1.4. Limitations of IDs

To this point, we've been able to generate cross-references easily. There are some limitations of the ID datatype and the id() function, though:

If you want to use the ID datatype, you have to declare the attributes that use that datatype in your DTD or schema. Unfortunately, if your DTD is defined externally to your XML document, the XML parser isn't required to read it. If the DTD isn't read, then the parser has no idea that a given attribute is of type ID.
You must define the ID and IDREF relationship in the XML document. It would be nice to have the XML document define the data only, with the relationships between parts of the document defined externally (say, in a stylesheet). That way, if you needed to define a new relationship between parts of the document, you could do it by creating a new stylesheet, and you wouldn't have to modify your XML document. Requiring the XML document structure to change every time you need to define a new relationship between parts of the document will become unwieldy quickly.
An element can have at most one attribute of type ID. If you'd like to refer to the same element in more than one way, you can't use the id() function.
Any given ID value can be found on at most one element. If you'd like to refer to more than one element with a single value, you can't use the id() function for that, either.
Only one set of IDs exists for the entire document. In other words, if you declare the attributes customer_number, part_number, and order_number to be of type ID, the value of a customer_number must be unique across all the attributes of type ID. It is illegal in this case for a customer_number to be the same as a part_number, even though those attributes might belong to different elements.
An ID can only be an attribute of an XML element. The only way you can use the id() function to refer to another element is through its attribute of type ID. If you want to find another element based on an attribute that isn't an ID, based on the element's content, based on the element's children, etc., the id() function is of no use whatsoever.
The value of an ID must be an XML name. In other words, it can't contain spaces, it can't start with a number, and it's subject to the other restrictions of XML names. (Section 2.3 of the XML Recommendation defines these restrictions; see http://www.w3.org/TR/REC-xml if you'd like more information.)

To get around all of these limitations, XSLT defines the key() function. We'll discuss that function in the next section.