XML (Perl Cookbook, 2nd Edition)

22.0. Introduction

The Extensible Markup Language (XML) standard was released in 1998. It quickly became the standard way to represent and exchange almost every kind of data, from books to genes to function calls.

XML succeeded where other past "standard" data formats failed (including XML's ancestor, SGML—the Standard Generalized Markup Language). There are three reasons for XML's success: it is text-based instead of binary, it is simple rather than complex, and it has a superficial resemblance to HTML.

Text: Unix realized nearly 30 years before XML that humans primarily interact with computers through text. Thus text files are the only files any system is guaranteed to be able to read and write. Because XML is text, programmers can easily make legacy systems emit XML reports.
Simplicity: As we'll see, a lot of complexity has arisen around XML, but the XML standard itself is very simple. There are very few things that can appear in an XML document, but from those basic building blocks you can build extremely complex systems.
HTML: XML is not HTML, but XML and HTML share a common ancestor: SGML. The superficial resemblance meant that the millions of programmers who had to learn HTML to put data on the web were able to learn (and accept) XML more easily.

22.0.1. Syntax

Example 22-1 shows a simple XML document.

Example 22-1. Simple XML document

<?xml version="1.0" encoding="UTF-8"?>
<books>
  <!-- Programming Perl 3ed -->
  <book id="1">
    <title>Programming Perl</title>
    <edition>3</edition>
    <authors>
      <author>
        <firstname>Larry</firstname>
        <lastname>Wall</lastname>
      </author>
      <author>
        <firstname>Tom</firstname>
        <lastname>Christiansen</lastname>
      </author>
      <author>
        <firstname>Jon</firstname>
        <lastname>Orwant</lastname>
      </author>
    </authors>
    <isbn>0-596-00027-8</isbn>
  </book>
  <!-- Perl & LWP -->
  <book id="2">
    <title>Perl &amp; </title>
    <edition>1</edition>
    <authors>
      <author>
        <firstname>Sean</firstname>
        <lastname>Burke</lastname>
      </author>
    </authors>
    <isbn>0-596-00178-9</isbn>
  </book>
  <book id="3">
    <!-- Anonymous Perl -->
    <title>Anonymous Perl</title>
    <edition>1</edition>
    <authors />
    <isbn>0-555-00178-0</isbn>
  </book>
</books>

At first glance, XML looks a lot like HTML: there are elements (e.g., <book> </book>), entities (e.g., & and <), and comments (e.g., ). Unlike HTML, XML doesn't define a standard set of elements, and defines only a minimum set of entities (for single quotes, double quotes, less-than, greater-than, and ampersand). The XML standard specifies only syntactic building blocks like the < and > around elements. It's up to you to create the vocabulary, that is, the element and attribute names like books, authors, etc., and how they nest.

XML's opening and closing elements are familiar from HTML:

<book>
</book>

XML adds a variation for empty elements (those with no text or other elements between the opening and closing tags):

<author />

Elements may have attributes, as in:

<book id="1">

Unlike HTML, the case of XML elements, entities, and attributes matters: <Book> and <book> start two different elements. All attributes must be quoted, either with single or double quotes (id='1' versus id="1"). Unicode letters, underscores, hyphens, periods, and numbers are all acceptable in element and attribute name, but the first character of a name must be a letter or an underscore. Colons are allowed only in namespaces (see Namespaces, later in this chapter).

Whitespace is surprisingly tricky. The XML specification says anything that's not a markup character is content. So (in theory) the newlines and whitespace indents between tags in Example 22-1 are text data. Most XML parsers offer the choice of retaining whitespace or sensibly folding it (e.g., to ignore newlines and indents).

22.0.2. XML Declaration

The first line of Example 22-1 is the XML declaration:

<?xml version="1.0" encoding="UTF-8" ?>

This declaration is optional—Version 1.0 of XML and UTF-8 encoded text are the defaults. The encoding attribute specifies the Unicode encoding of the document. Some XML parsers can cope with arbitrary Unicode encodings, but others are limited to ASCII and UTF-8. For maximum portability, create XML data as UTF-8.

22.0.3. Processing Instructions

<title><?pdf font Helvetica 18pt?>XML in Perl</title>

Processing instructions have the general structure:

<?target data ... ?>

When an XML processor encounters a processing instruction, it checks the target. Processors should ignore targets they don't recognize. This lets one XML file contain instructions for many different processors. For example, the XML source for this book might have separate instructions for programs that convert to HTML and to PDF.

22.0.4. Comments

XML comments have the same syntax as HTML comments:

<!-- ... -->

The comment text can't contain --, so comments don't nest.

22.0.5. CDATA

Sometimes you want to put text in an XML document without having to worry about encoding entities. Such a literal block is called CDATA in XML, written:

<![CDATA[literal text here]]>

The ugly syntax betrays XML's origins in SGML. Everything after the initial <![CDATA[ and up to the ]]> is literal data in which XML markup characters such as < and & have no special meaning.

For example, you might put sample code that contains a lot of XML markup characters in a CDATA block:

<para>The code to do this is as follows:</para>
<code><![CDATA[$x = $y << 8 & $z]]>

22.0.6. Well-Formed XML

To ensure that all XML documents are parsable, there are some minimum requirements expected of an XML document. The following list is adapted from the list in Perl & XML, by Erik T. Ray and Jason McIntosh (O'Reilly):

The document must have one and only one top-level element (e.g., books in Example 22-1).
Every element with content must have both a start and an end tag.
All attributes must have values, and those values must be quoted.
Elements must not overlap.
Markup characters (<, >, and &) must be used to indicate markup only. In other words, you can't have <title>Perl & XML</title> because the & can only indicate an entity reference. CDATA sections are the only exception to this rule.

If an XML document meets these rules, it's said to be "well-formed." Any XML parser that conforms to the XML standard should be able to parse a well-formed document.

22.0.6. Schemas

There are two parts to any program that processes an XML document: the XML parser, which manipulates the XML markup, and the program's logic, which identifies text, the elements, and their structure. Well-formedness ensures that the XML parser can work with the document, but it doesn't guarantee that the elements have the correct names and are nested correctly.

For example, these two XML fragments encode the same information in different ways:

<book>
  <title>Programming Perl</title>
  <edition>3</edition>
  <authors>
    <author>
      <firstname>Larry</firstname>
      <lastname>Wall</lastname>
    </author>
    <author>
      <firstname>Tom</firstname>
      <lastname>Christiansen</lastname>
    </author>
    <author>
      <firstname>Jon</firstname>
      <lastname>Orwant</lastname>
    </author>
  </authors>
</book>

<work>
  <writers>Larry Wall, Tom Christiansen, and Jon Orwant</writers>
  <name edition="3">Programming Perl</name>
</work>

The structure is different, and if you wrote code to extract the title from one ("get the contents of the book element, then find the contents of the title element within that") it would fail completely on the other. For this reason, it is common to write a specification for the elements, attributes, entities, and the ways to use them. Such a specification lets you be confident that your program will never be confronted with XML it cannot deal with. The two formats for such specifications are DTDs and schemas.

DTDs are the older and more limited format, acquired by way of XML's SGML past. DTDs are not written in XML, so you need a custom (complex) parser to work with them. Additionally, they aren't suitable for many uses—simply saying "the book element must contain one each of the title, edition, author, and isbn elements in any order" is remarkably difficult.

For these reasons, most modern content specifications take the form of schemas. The World Wide Web Consortium (W3C), the folks responsible for XML and a host of related standards, have a standard called XML Schema (http://www.w3.org/TR/xmlschema-0/). This is the most common schema language in use today, but it is complex and problematic. An emerging rival for XML Schema is the OASIS group's RelaxNG; see http://www.oasis-open.org/committees/relax-ng/spec-20011203.html for more information.

There are Perl modules for working with schemas. The most important action you do with a schemas, however, is to validate an XML document against a schema. Recipe 22.5 shows how to use XML::LibXML to do this. XML::Parser does not support validation.

22.0.7. Namespaces

One especially handy property of XML is nested elements. This lets one document encapsulate another. For example, you want to send a purchase order document in a mail message. Here's how you'd do that:

<mail>
  <header>
    <from>me@example.com</from>
    <to>you@example.com</to>
    <subject>PO for my trip</subject>
  </header>
  <body>
    <purchaseorder>
      <for>Airfare</for>
      <bill_to>Editorial</bill_to>
      <amount>349.50</amount>
    </purchaseorder>
  </body>
</mail>

It worked, but we can easily run into problems. For example, if the purchase order used <to> instead of <bill_to> to indicate the department to be charged, we'd have two elements named <to>. The resulting document is sketched here:

<mail>
  <header>
    <to>you@example.com</to>
  </header>
  <body>
    <to>Editorial</to>
  </body>
</mail>

This document uses to for two different purposes. This is similar to the problem in programming where a global variable in one module has the same name as a global variable in another module. Programmers can't be expected to avoid variable names from other modules, because that would require them to know every module's variables.

The solution to the XML problem is similar to the programming problem's solution: namespaces. A namespace is a unique prefix for the elements and attributes in an XML vocabulary, and is used to avoid clashes with elements from other vocabularies. If you rewrote your purchase-order email example with namespaces, it might look like this:

<mail xmlns:email="http://example.com/dtds/mailspec/">
  <email:from>me@example.com</email:from>
  <email:to>you@example.com</email:to>
  <email:subject>PO for my trip</email:subject>
  <email:body>
    <purchaseorder xmlns:po="http://example.com/dtd/purch/">
      <po:for>Airfare</po:for>
      <po:to>Editorial</po:to>
      <po:amount>349.50</po:amount>
    </purchaseorder>
  </email:body>
</mail>

An attribute like xmnls:prefix="URL" identifies the namespace for the contents of the element that the attribute is attached to. In this example, there are two namespaces: email and po. The email:to element is different from the po:to element, and processing software can avoid confusion.

Most of the XML parsers in Perl support namespaces, including XML::Parser and XML::LibXML.

22.0.8. Transformations

One of the favorite pastimes of XML hackers is turning XML into something else. In the old days, this was accomplished with a program that knew a specific XML vocabulary and could intelligently turn an XML file that used that vocabulary into something else, like a different type of XML, or an entirely different file format, such as HTML or PDF. This was such a common task that people began to separate the transformation engine from the specific transformation, resulting in a new specification: XML Stylesheet Language for Transformations (XSLT).

Turning XML into something else with XSLT involves writing a stylesheet. A stylesheet says "when you see this in the input XML, emit that." You can encode loops and branches, and identify elements (e.g., "when you see the book element, print only the contents of the enclosed title element").

Transformations in Perl are best accomplished through the XML::LibXSLT module, although XML::Sablotron and XML::XSLT are sometimes also used. We show how to use XML::LibXSLT in Recipe 22.7.

22.0.9. Paths

Of the new vocabularies and tools for XML, possibly the most useful is XPath. Think of it as regular expressions for XML structure—you specify the elements you're looking for ("the title within a book"), and the XPath processor returns a pointer to the matching elements.

An XPath expression looks like:

/books/book/title

Slashes separate tests. XPath has syntax for testing attributes, elements, and text, and for identifying parents and siblings of nodes.

The XML::LibXML module has strong support for XPath, and we show how to use it in Recipe 22.6. XPath also crops up in the XML::Twig module shown in Recipe 22.8.

22.0.10. History of Perl and XML

Initially, Perl had only one way to parse XML: regular expressions. This was prone to error and often failed to deal with well-formed XML (e.g., CDATA sections). The first real XML parser in Perl was XML::Parser, Larry Wall's Perl interface to James Clark's expat C library. Most other languages (notably Python and PHP) also had an expat wrapper as their first correct XML parser.

XML::Parser was a prototype—the mechanism for passing components of XML documents to Perl was experimental and intended to evolve over the years. But because XML::Parser was the only XML parser for Perl, people quickly wrote applications using it, and it became impossible for the interface to evolve. Because XML::Parser has a proprietary API, you shouldn't use it directly.

XML::Parser is an event-based parser. You register callbacks for events like "start of an element," "text," and "end of an element." As XML::Parser parses an XML file, it calls the callbacks to tell your code what it's found. Event-based parsing is quite common in the XML world, but XML::Parser has its own events and doesn't use the standard Simple API for XML (SAX) events. This is why we recommend you don't use XML::Parser directly.

The XML::SAX modules provide a SAX wrapper around XML::Parser and several other XML parsers. XML::Parser parses the document, but you write code to work with XML::SAX, and XML::SAX translates between XML::Parser events and SAX events. XML::SAX also includes a pure Perl parser, so a program for XML::SAX works on any Perl system, even those that can't compile XS modules. XML::SAX supports the full level 2 SAX API (where the backend parser supports features such as namespaces).

The other common way to parse XML is to build a tree data structure: element A is a child of element B in the tree if element B is inside element A in the XML document. There is a standard API for working with such a tree data structure: the Document Object Model (DOM). The XML::LibXML module uses the GNOME project's libxml2 library to quickly and efficiently build a DOM tree. It is fast, and it supports XPath and validation. The XML::DOM module was an attempt to build a DOM tree using XML::Parser as the backend, but most programmers prefer the speed of XML::LibXML. In Recipe 22.2 we show XML::LibXML, not XML::DOM.

So, in short: for events, use XML::SAX with XML::Parser or XML::LibXML behind it; for DOM trees, use XML::LibXML; for validation, use XML::LibXML.

22.0.11. Further Reading

While the XML specification itself is simple, the specifications for namespaces, schemas, stylesheets, and so on are not. There are many good books to help you learn and use these technologies:

For help with all of the nuances of XML, try Learning XML, by Erik T. Ray (O'Reilly), and XML in a Nutshell, Second Edition, by Elliotte Rusty Harold and W. Scott Means (O'Reilly).
For help with XML Schemas, try XML Schema, by Eric van der Vlist (O'Reilly).
For examples of stylesheets and transformations, and help with the many non-trivial aspects of XSLT, see XSLT, by Doug Tidwell (O'Reilly), and XSLT Cookbook, by Sal Mangano (O'Reilly).
For help with XPath, try XPath and XPointer, by John E. Simpson (O'Reilly).