Common XML Processing Issues (XML in a Nutshell, 2nd Edition)

17.2. Common XML Processing Issues

As with any technology, there are several ways to accomplish most design goals when developing a new XML application, as well as a few potential problems worth knowing about ahead of time. An understanding of the intended uses for these features can help ensure that new applications will be compatible not only with their intended target audience, but also with other XML processing systems that may not even exist yet.

17.2.1. What You Get Is Not What You Saw

The XML specification provides several loopholes that permit XML parsers to play fast and loose with your document's literal contents, while retaining the semantic meaning. Comments can be omitted and entity references silently replaced by the parser without any warning to the client application. Nonvalidating parsers aren't required to retrieve external DTDs or entities, though the parser should at least warn applications that this is happening. While reconstructing an XML document with exactly the same logical structure and content is possible, guaranteeing that it will match the original in a byte-by-byte comparison is not.

TIP: XML Canonicalization defines a form of XML and a process for getting there that permits a much higher degree of predictability in reconstructing documents from their logical model. For details, see http://www.w3.org/TR/xml-c14n.

Authors of simple XML processing tools that act on data without storing or modifying it might not consider these constraints particularly restrictive. The ability to reconstruct an XML document precisely from in-memory data structures, however, becomes more critical for authors of XML editing tools and content-management solutions. While no parser is required to make all comments, whitespace, and entity references available from the parse stream, many do or can be made to do so with the proper configuration options.

The only real option to ensure that your parser reports documents as you want, and not just the minimum required by the XML 1.0 specification, is to check its documentation and configure (or choose) your parser accordingly.

17.2.2. Comments

Despite a long history in HTML of using comments for tasks like Server-Side Includes (SSI) and for hiding JavaScript code and Cascading Style Sheets, using comments for anything other than human-readable notes is generally a bad idea in XML. XML parsers may (and frequently do) discard comments entirely, keeping them from reaching an application at all. Transformations generally discard comments as well.

17.2.3. Processing Instructions

XML parsers are required to provide client applications access to XML processing instructions. Processing instructions provide a mechanism for document authors to communicate with XML-aware applications behind the scenes in a way that doesn't interfere with the content of the documentation. DTD and schema validation both ignore processing instructions, making it possible to use them anywhere in a document structure without changing the DTD or schema. The processing instruction's most widely recognized application is its ability to embed stylesheet references inside XML documents. The following XML fragment shows a stylesheet reference:

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="test.css"?>

An XML-aware application, such as Internet Explorer 5.5, would be capable of recognizing the XML author's intention to display the document using the test.css stylesheet. This processing instruction can also be used for XSLT stylesheets or other kinds of stylesheets not yet developed, though the application needs to understand how to process them to make this work. Applications that do not understand the processing instructions can still parse and use the information in the XML document while ignoring the unfamiliar processing instruction.

TIP: For more information on how to use processing instructions to designate a stylesheet(s), see the W3C Recommendation Associating Style Sheets with XML Documents at http://www.w3.org/TR/xml-stylesheet/.

The furniture example from Chapter 20 (see Figure 20-1) gives a hypothetical application of processing instructions. A processing instruction in the bookcase.xml file signals the furniture example's processor to verify the parts list from the document against the true list of parts required to build the furniture item:

    <parts_list>
        <part_name id="A" count="1">END PANEL</part_name>
        <part_name id="B" count="2">SIDE PANEL</part_name>
        <part_name id="C" count="1">BACK PANEL</part_name>
        <part_name id="D" count="4">SHELF</part_name>
        <part_name id="E" count="8">HIDDEN CONNECTORS</part_name>
        <part_name id="F" count="8">CONNECTOR SCREWS</part_name>
        <part_name id="G" count="22">7/16" TACKS</part_name>
        <part_name id="H" count="16">SHELF PEGS</part_name>
    </parts_list>
  
<?furniture_app    verify_parts_list?>

This processing instruction is meaningless unless the parsing application understands the given type of processing instruction.

The XML specification also permits the association of the processing instruction's target--the XML name immediately after the <? with a notation, as described in the next section--but this is not required and is rarely used in XML.

17.2.4. Notations

The notation syntax of XML provides a way for the document author to specify an external unparsed entity's type within the XML document's framework. If an application requires access to external data that cannot be represented in XML, consider declaring a notation name and using it where appropriate when declaring external unparsed entities. For example, if an XML application were an annotated Java source-code format, the compiled bytecode could then be referenced as an external unparsed entity.

Notations effectively provide metadata, identifiers that applications may apply to information. Using notations requires making declarations in the DTD, as described in Chapter 3. One use of notations is with NOTATION-type attributes. For example, if a document contained various scripts designed for different environments, it might declare some notations and then use an attribute on a containing element to identify what kind of script it contained:

<!NOTATION DOS PUBLIC "-//MS/DOS Batch File/">
<!NOTATION BASH PUBLIC "-//UNIX/BASH Shell Script/">
<!ELEMENT batch_code (#PCDATA)*>
<!ATTLIST batch_code 
    lang NOTATION (DOS | BASH)>
. . . 
<batch_code lang="DOS">
  echo Hello, world!
</batch_code>

Applications that read this document and recognized the public identifier could interpret the foreign element data correctly, based on its type. (Notations can also have system identifiers, and applications can use either approach.)

Categorizing processing instructions is the other use of notations important to custom XML applications. For instance, the previous furniture_app processing-instruction example could have been declared as a notation in the DTD:

<!NOTATION furniture_app SYSTEM "http://namespaces.example.com/furniture">

Then the furniture-document processing application could verify that the processing instruction was actually intended for itself and not for another application that used a processing instruction with the same name.

17.2.5. Unparsed Entities

Unparsed entities combine attribute and notation declarations to define references to content that will require further (unspecified) processing by the application. Unparsed entities are described in more detail in Chapter 3, but though they are a feature available to applications, they are also rarely used and not generally considered interoperable among XML processors. The linking and referencing tools described in the next section are more commonly used instead.

17.2.6. Links and References

The ability to create links between and within documents is important to XML's long-term success, both on the World Wide Web and for other applications concerned about the relationships between information. The XLink specification, described in Chapter 10, defines the semantics of how these links can be created. Unlike simple HTML links, XLinks can express sophisticated relationships between the source and target elements of a link.

If an XML application requires the ability to encode relationships between various parts of an XML document, or between different documents, implementing this functionality using the XLinks recommendation should be considered. Not only would it save the effort of defining a new (and incompatible) linking scheme, the resulting documents would be intelligible to new XML authoring tools and browsers as XLinks support becomes more widespread. RDDL, described in Chapter 14, makes extensive use of XLink for machine-readable linking.