Processing Instructions and Other Markup (Perl and XML)

2.8. Processing Instructions and Other Markup

Besides elements, you can use several other syntactic objects to make XML easier to manage. Processing instructions (PIs) are used to convey information to a particular XML processor. They specify the intended processor with a target parameter, which is followed by an optional data parameter. Any program that doesn't recognize the target simply skips the PI and pretends it never existed. Here is an example based on an actual behind-the-scenes O'Reilly book hacking experience:

<?file-breaker start chap04.xml?><chapter>
<title>The very long title<?lb?>that seemed to go on forever and ever</title>
<?xml2pdf vspace 10pt?>

The first PI has a target called file-breaker and its data is chap04.xml. A program reading this document will look for a PI with that target keyword and will act on that data. In this case, the goal is to create a new file and save the following XML into it.

The second PI has only a target, lb. We have actually seen this example used in documents to tell an XML processor to create a line break at that point. This example has two problems. First, the PI is a replacement for a space character; that's bad because any program that doesn't recognize the PI will not know that a space should be between the two words. It would be better to place a space after the PI and let the target processor remove any following space itself. Second, the target is an instruction, not an actual name of a program. A more unique name like the one in the next PI, xml2pdf, would be better (with the lb appearing as data instead).

PIs are convenient for developers. They have no solid rules that specify how to name a target or what kind of data to use, but in general, target names ought to be very specific and data should be very short.

Those who have written documents using Perl's built-in Plain Old Documentation mini-markup language [9] hackers may note a similarity between PIs and certain POD directives, particularly the =for paragraphs and =begin/=end blocks. In these paragraphs and blocks, you can leave little messages for a POD processor with a target and some arguments (or any string of text).

[9]The gory details of which lie in Chapter 26 of Programming Perl, Third Edition or in the perlpod manpage.

Another useful markup object is the XML comment. Comments are regions of text that any XML processor ignores. They are meant to hold information for human eyes only, such as notes written by authors to themselves and their collaborators. They are also useful for turning "off" regions of markup -- perhaps if you want to debug the document or you're afraid to delete something altogether. Here's an example:

<!-- this is invisible to the parser -->
This is perfectly visible XML content.
<!--
  <para>This paragraph is no longer part of the document.</para>
-->

Note that these comments look and work exactly like their HTML counterparts.

The only thing you can't put inside a comment is another comment. You can't even feint at nesting comments; the string " -- ", for example, is illegal in a comment, no matter how you use it.

The last syntactic convenience we will discuss is the CDATA section. CDATA stands for character data, which in XML parlance means unparsed content. In other words, the XML processor treats an entire CDATA section as though it contains no markup at all -- even things that look like markup. This is useful if you want to include a large region of illegal characters like <, >, and & that would be difficult to convert into character entity references.

For example:

<codelisting>
<![CDATA[if( $val > 3 && @lines ) {
  $input = <FILE>;
}]]>
</codelisting>

Everything after <![CDATA[ and before the ]]> is treated as nonmarkup data, so the markup symbols are perfectly fine. We rarely use CDATA sections because they are kind of unsightly, in our humble opinion, and make writing XML processing code a little harder. But it's there if you need it.[10]

[10]We use CDATA throughout the DocBook-flavored XML that makes up this book. We wrapped all the code listings and sample XML documents in it so we didn't have to suffer the bother of escaping every < and & that appears in them.