XML Basics (XSLT)

<?xml version="1.0"?> <postalcodes> <title>Most-used postal codes in November 2000</title> <item> <city>Schenectady</city> <postalcode>12304</postalcode> <usage-count>2039</usage-count> </item> <item> <city>Kuala Lumpur</city> <postalcode>57000</postalcode> <usage-count>1983</usage-count> </item> <item> <city>London</city> <postalcode>SW1P 4RG</postalcode> <usage-count>1722</usage-count> </item> ... </postalcodes>

<?xml version="1.0" encoding="UTF-8"?> <!ELEMENT po (customer-id , item-ordered+ , order-date)> <!ELEMENT customer-id (#PCDATA)> <!ELEMENT item-ordered EMPTY> <!ATTLIST item-ordered part-number CDATA #REQUIRED quantity CDATA #REQUIRED > <!ELEMENT order-date EMPTY> <!ATTLIST order-date day CDATA #REQUIRED month CDATA #REQUIRED year CDATA #REQUIRED >

<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"> <xsd:element name="po"> <xsd:complexType> <xsd:sequence> <xsd:element ref="customer-id"/> <xsd:element ref="item-ordered" maxOccurs="unbounded"/> <xsd:element ref="order-date"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="customer-id" type="xsd:string"/> <xsd:element name="item-ordered"> <xsd:complexType> <xsd:attribute name="part-number" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:pattern value="[0-9]{5}-[0-9]{4}-[0-9]{5}"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="quantity" use="required" type="xsd:integer"/> </xsd:complexType> </xsd:element> <xsd:element name="order-date"> <xsd:complexType> <xsd:attribute name="day" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:maxInclusive value="31"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="month" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:maxInclusive value="12"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="year" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:integer"> <xsd:maxInclusive value="2100"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType> </xsd:element> </xsd:schema>

1.2.3. DOM and SAX

The two most popular APIs used to parse XML documents are the Document Object Model (DOM) and the Simple API for XML (SAX). DOM is an official recommendation of the W3C (available at http://www.w3.org/TR/REC-DOM-Level-1), while SAX is a de facto standard created by David Megginson and others on the XML-DEV mailing list (http://lists.xml.org/archives). We'll discuss these two APIs briefly here. We won't use them much in this book, but discussing them will give you some insight into how most XSLT processors work.

TIP: See http://www.megginson.com/SAX/ for the SAX standard. (Make sure the letters SAX are in uppercase.) If you'd like to learn more about the XML-DEV mailing list, send email with "subscribe xml-dev" in the body of the message to majordomo@xml.org. You can also check out http://www.lists.ic.ac.uk/hypermail/xml-dev to see the XML-DEV mailing list archives.

1.2.3.1. DOM

DOM is designed to build a tree view of your document. Remember that all XML documents must be contained in a single element; that single element becomes the root of the tree. The DOM specification defines several language-neutral interfaces, described here:

Node

This interface is the base datatype of the DOM. Element, document, text, comment, and attr all extend the Node interface.

Document

This object contains the DOM representation of the XML document. Given a Document object, you can get the root of the tree (the Document element); from the root, you can move through the tree to find all elements, attributes, text, comments, processing instructions, etc., in the XML document.

Element

This interface represents an element in an XML document.

Attr

This interface represents an attribute of an element in an XML document.

Text

This interface represents a piece of text from the XML document. Any text in your XML document becomes a Text node. This means that the text of a DOM object is a child of the object, not a property of it. The text of an Element is represented as a Text child of an Element object; the text of an Attr is also represented that way.

Comment

This interface represents a comment in the XML document. A comment begins with . The only restriction on its contents is that two consecutive hyphens (--) can appear only at the start or end of the comment. Other than that, a comment can include angle brackets (< >), ampersands (&), single or double quotation marks (' "), and anything else.

ProcessingInstruction

This interface represents a processing instruction in the XML document. Processing instructions look like this:

<?xml-stylesheet href="case-study.xsl" type="text/xsl"?>
<?cocoon-process type="xslt"?>

Processing instructions contain processor-specific information. The first of the two PIs (PI is XML jargon -- feel free to drop this into casual conversations to impress your friends) is the standard way to associate an XSLT stylesheet with an XML document (more on this in a minute). The second PI is used by Cocoon, an XML publishing framework from the Apache Software Foundation. (If you're not familiar with Cocoon, look at the Cocoon home page at http://xml.apache.org/cocoon.)

When you parse an XML document with a DOM parser, it:

Creates objects (Elements, Attr, Text, Comments) representing the contents of the document. These objects implement the interfaces defined in the DOM specification.
Arranges these objects in a tree. Each Element in the XML document has some properties (such as the element's name), and may also have some children.
Parses the entire document before control returns to your code. This means that for large documents, there is a significant delay while the document is parsed.

The most significant thing about the DOM is that it is based on a tree view of your document. An XSLT processor uses a very similar tree view (with some slight differences, such as the fact that not everything we deal with in XPath and XSLT has the same root element). Understanding how a DOM parser works makes it easier to understand how an XSLT processor views your document.

1.2.3.1.1. A sample DOM tree

DOM, XSLT, and XPath all use tree structures to represent data from an XML document. For this reason, it's important to have at least a casual knowledge of how DOM builds a tree structure. Our earlier <postalcodes> document is shown as a DOM tree in Figure 1-2.

Figure 1-2. DOM tree representation of an XML document

NOTE: The image in Figure 1-2 was produced by the DOMit servlet, an XML validation service available at http://www-106.ibm.com/developerworks/features/xmlvalidatorform.html.

If we want to find different parts of our XML document, sort the subtrees based on the first character of the text of the <postalcode> element, or select only the subtrees in which the text of the <usage-count> element has a numeric value greater than 500, we have to start at the top of the DOM tree and work our way down through the root element's descendants. When we write XSLT stylesheets, we also start at the root of the tree and work our way down.

WARNING: To be honest, the DOM tree built for our document is more complicated than our beautiful picture indicates. The whitespace characters in our document (carriage return/line feed, tabs, spaces, etc.) become Text nodes. Normally it's a good idea to remove this whitespace so the DOM tree won't be littered with these useless Text nodes, but I included them here to give you a sense of the XML document's structure.

1.2.3.2. SAX

The Simple API for XML was developed by David Megginson and others on the XML-DEV mailing list. It has several important differences from DOM:

The SAX API is interactive. As a SAX parser processes your document, it sends events to your code. You don't have to wait for the parser to finish the entire document as you do with the DOM; you get events from the parser immediately. These events let you know when the parser finds the start of the document, the start of an element, some text, the end of an element, a processing instruction, the end of the document, etc.
SAX is designed to avoid the large memory footprint of DOM. In the SAX world, you're told when the parser finds things in the XML document; it's up to you to save those things. If you don't do anything to store the data found by the parser, it goes into the bit bucket.
SAX doesn't provide the hierarchical view of the document that DOM does. If you need to know a lot about the structure of an XML document and the context of a given element, SAX isn't much help. Each SAX event is stateless; that is, a SAX event won't tell you, "Here's some text for the <postalcode> element I mentioned earlier." A SAX parser only tells you, "Here's some text." If you need to know about an XML document's structure, you have to keep track of that information yourself.

The best thing about SAX is that it is interactive. Most of the transformations currently done with XSLT take place on the server. As of this writing, most XSLT processors are based on DOM parsers. In the near future, however, we'll see XSLT processors based on SAX parsers. This means that the processor can start generating results almost as soon as the parse of the source document begins, resulting in better throughput and creating the perception of faster service. Because DOM, XPath, and XSLT all use trees to represent XML documents, DOM is more relevant to our discussions here. Nevertheless, it's useful to know how SAX parsers work, especially as SAX-based XSLT processors begin to rear their speedy little heads.

1.2. XML Basics

1.2.1. XML's Heritage

1.2.2. XML Document Rules

1.2.2.1. An XML document must be contained in a single element

1.2.2.2. All elements must be nested

1.2.2.3. All attributes must be quoted

1.2.2.4. XML tags are case-sensitive

1.2.2.5. All end tags are required

1.2.2.6. Empty tags can contain the end marker

1.2.2.7. XML declarations

1.2.2.8. Document Type Definitions (DTDs) and XML Schemas

Figure 1-1. Automatically generated XML Schema documentation

1.2.2.9. Well-formed versus valid documents

1.2.2.10. Tags versus elements

1.2.2.11. Namespaces

1.2.3. DOM and SAX

1.2.3.1. DOM

1.2.3.1.1. A sample DOM tree

Figure 1-2. DOM tree representation of an XML document

1.2.3.2. SAX

1.2.4. XML Standards

1.2.4.1. XML 1.0

1.2.4.2. The Extensible Stylesheet Language (XSL)

1.2.4.3. XML Schemas

1.2.4.4. The Simple API for XML (SAX)

1.2.4.5. Document Object Model (DOM) Level 1

1.2.4.6. Document Object Model (DOM) Level 2

1.2.4.7. Namespaces in XML

1.2.4.8. Associating stylesheets with XML documents

1.2.4.9. Scalable Vector Graphics (SVG)

1.2.4.10. Canonical XML Version 1.0

1.2.4.11. XML digital signatures

1.2.4.12. XML Pointer Language (XPointer) Version 1.0

1.2.4.13. XML Linking Language (XLink) Version 1.0