Nuts and Bolts (Java & XML, 2nd Edition)

With the introductions behind us, let's get to it. Before heading straight into Java, though, some basic structures must be laid down. These address a fundamental understanding of the concepts in XML and how the extensible markup language works. In other words, you need an XML primer. If you are already an XML expert, skim through this chapter to make sure you're comfortable with the topics addressed. If you're completely new to XML, on the other hand, this chapter can get you ready for the rest of the book without hours, days, or weeks of study.

You can use this chapter as a glossary while you read the rest of the book. I won't spend time in future chapters explaining XML concepts, in order to deal strictly with Java and get to some more advanced concepts. So if you hit something that completely befuddles you, check this chapter for information. And if you are still a little lost, I highly recommended that this book be read with a copy of Elliotte Harold and Scott Means' excellent book XML in a Nutshell (O'Reilly) open. That will give you all the information you need on XML concepts, and then I can focus on Java ones.

Finally, I'm big on examples. I'm going to load the rest of the chapters as full of them as possible. I'd rather give you too much information than barely engage you. To get started along those lines, I'll introduce several XML and related documents in this chapter to illustrate the concepts in this primer. You might want to take the time to either type these into your editor or download them from the book's web site (http://www.newInstance.com), as they will be used in this chapter and throughout the rest of the book. It will save you time later on.

2.1. The Basics

It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 2-1 shows a simple XML document that conforms to this specification. It's a portion of the XML table of contents for this book (I've only included part of it because it's long!). The complete file is included with the samples for the book, available online at http://www.oreilly.com/catalog/javaxml2 and http://www.newInstance.com. I'll use it to illustrate several important concepts.

Example 2-1. The contents.xml document

<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "DTD/JavaXML.dtd">

<!-- Java and XML Contents -->
<book xmlns="http://www.oreilly.com/javaxml2"
      xmlns:ora="http://www.oreilly.com"
>
  <title ora:series="Java">Java and XML</title>

  <!-- Chapter List -->
  <contents>
    <chapter title="Introduction" number="1">
      <topic name="XML Matters" />
      <topic name="What's Important" />
      <topic name="The Essentials" />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="Nuts and Bolts" number="2">
      <topic name="The Basics" />
      <topic name="Constraints" />
      <topic name="Transformations" />
      <topic name="And More..." />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="SAX" number="3">
      <topic name="Getting Prepared" />
      <topic name="SAX Readers" />
      <topic name="Content Handlers" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter> 
    <chapter title="Advanced SAX" number="4">
      <topic name="Properties and Features" />
      <topic name="More Handlers" />
      <topic name="Filters and Writers" />
      <topic name="Even More Handlers" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter>
    <chapter title="DOM" number="5">
      <topic name="The Document Object Model" />
      <topic name="Serialization" />
      <topic name="Mutability" />
      <topic name="Gotcha!" />
      <topic name="What&apos;s Next?" />
    </chapter>           

    <!-- And so on... -->

  </contents>

  <ora:copyright>&OReillyCopyright;</ora:copyright>
</book>

2.1.1. XML 1.0

A lot of this specification describes what is mostly intuitive. If you've done any HTML authoring, or SGML, you're already familiar with the concept of elements (such as contents and chapter in the example) and attributes (such as title and name). In XML, there's little more than definition of how to use these items, and how a document must be structured. XML spends more time defining tricky issues like whitespace than introducing any concepts that you're not at least somewhat familiar with.

An XML document can be broken into two basic pieces: the header, which gives an XML parser and XML applications information about how to handle the document; and the content, which is the XML data itself. Although this is a fairly loose division, it helps us differentiate the instructions to applications within an XML document from the XML content itself, and is an important distinction to understand. The header is simply the XML declaration, in this format:

<?xml version="1.0"?>

The header can also include an encoding, and whether the document is a standalone document or requires other documents to be referenced for a complete understanding of its meaning:

<?xml version="1.0" encoding="UTF8" standalone="no"?>

The rest of the header is made up of items like the DOCTYPE declaration:

<!DOCTYPE Book SYSTEM "DTD/JavaXML.dtd">

In this case, I've referred to a file on my local system, in the directory DTD/ called JavaXML.dtd. Any time you use a relative or absolute file path or a URL, you want to use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. As an example, take the DTD statement for XHTML 1.0:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Here, a public identifier is supplied (the funny little string starting with "-//"), followed by a system identifier (the URL). If the public identifier cannot be resolved, the system identifier is used instead.

You may also see processing instructions at the top of a file, and they are generally considered part of a document's header, rather than its content. They look like this:

<?xml-stylesheet href="XSL\JavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSL\JavaXML.wml.xsl" type="text/xsl" 
                 media="wap"?>
<?cocoon-process type="xslt"?>

Each is considered to have a target (the first word, like xml-stylesheet or cocoon-process), and data (the rest). More often than not, the data is in the form of name-value pairs, which can really help readability. This is only a good practice, though, and not required, so don't depend on it.

Other than that, the bulk of your XML document should be content; in other words, elements, attributes, and data that you have put into it.

2.1.1.1. The root element

The root element is the highest-level element in the XML document, and must be the first opening tag and the last closing tag within the document. It provides a reference point that enables an XML parser or XML-aware application to recognize a beginning and end to an XML document. In our example, the root element is book:

<book xmlns="http://www.oreilly.com/javaxml2"
      xmlns:ora="http://www.oreilly.com"
>
    <!-- Document content -->
</book>

This tag and its matching closing tag surround all other data content within the XML document. XML specifies that there may be only one root element in a document. In other words, the root element must enclose all other elements within the document. Aside from this requirement, a root element does not differ from any other XML element. It's important to understand this, because XML documents can reference and include other XML documents. In these cases, the root element of the referenced document becomes an enclosed element in the referring document, and must be handled normally by an XML parser. Defining root elements as standard XML elements without special properties or behavior allows document inclusion to work seamlessly.

2.1.1.2. Elements

So far I have glossed over defining an actual element. Let's take an in-depth look at elements, which are represented by arbitrary names and must be enclosed in angle brackets. There are several different variations of elements in the sample document, as shown here:

 <!-- Standard element opening tag -->
 <contents>

  <!-- Standard element with attribute -->
  <chapter title="Nuts and Bolts" number="2">

  <!-- Element with textual data -->
  <title ora:series="Java">Java and XML</title>

  <!-- Empty element -->
  <sectionBreak />

 <!-- Standard element closing tag -->
 </contents>

The first rule in creating elements is that their names must start with a letter or underscore, and then may contain any amount of letters, numbers, underscores, hyphens, or periods. They may not contain embedded spaces:

<!-- Embedded spaces are not allowed -->
<my element name>

XML element names are also case-sensitive. Generally, using the same rules that govern Java variable naming will result in sound XML element naming. Using an element named tcbo to represent Telecommunications Business Object is not a good idea because it is cryptic, while an overly verbose tag name like beginningOfNewChapter just clutters up a document. Keep in mind that your XML documents will probably be seen by other developers and content authors, so clear documentation through good naming is essential.

Every opened element must in turn be closed. There are no exceptions to this rule as there are in many other markup languages, like HTML. An ending element tag consists of the forward slash and then the element name: </content>. Between an opening and closing tag, there can be any number of additional elements or textual data. However, you cannot mix the order of nested tags: the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. A well-formed document is one in which all XML syntax rules are followed, and all elements and attributes are correctly positioned. However, a well-formed document is not necessarily valid, which means that it follows the constraints set upon a document by its DTD or schema. There is a significant difference between a well-formed document and a valid one; the rules I discuss in this section ensure that your document is well-formed, while the rules discussed in the constraints section allow your document to be valid.

As an example of a document that is not well-formed, consider this XML fragment:

<tag1>
 <tag2>
</tag1>
 </tag2>

The order of nesting of tags is incorrect, as the opened <tag2> is not followed by a closing </tag2> within the surrounding tag1 element. However, if these syntax errors are corrected, there is still no guarantee that the document will be valid.

While this example of a document that is not well-formed may seem trivial, remember that this would be acceptable HTML, and commonly occurs in large tables within an HTML document. In other words, HTML and many other markup languages do not require well-formed XML documents. XML's strict adherence to ordering and nesting rules allows data to be parsed and handled much more quickly than when using markup languages without these constraints.

The last rule I'll look at is the case of empty elements. I already said that XML tags must always be paired; an opening tag and a closing tag constitute a complete XML element. There are cases where an element is used purely by itself, like a flag stating a chapter is incomplete, or where an element has attributes but no textual data, like an image declaration in HTML. These would have to be represented as:

<chapterIncomplete></chapterIncomplete>
<img src="/images/xml.gif"></img>

This is obviously a bit silly, and adds clutter to what can often be very large XML documents. The XML specification provides a means to signify both an opening and closing element tag within one element:

<chapterIncomplete />
<img src="/images/xml.gif" />

What's with the Space Before Your End-Slash, Brett?

Well, let me tell you. I've had the unfortunate pleasure of working with Java and XML since late 1998, when things were rough, at best. And some web browsers at that time (and some today, to be honest) would only accept XHTML (HTML that is well-formed) in very specific formats. Most notably, tags like <br> that are never closed in HTML must be closed in XHTML, resulting in <br/>. Some of these browsers would completely ignore a tag like this; however, oddly enough, they would happily process <br /> (note the space before the end-slash). I got used to making my XML not only well-formed, but consumable by these browsers. I've never had a good reason to change these habits, so you get to see them in action here.

This nicely solves the problem of unnecessary clutter, and still follows the rule that every XML element must have a matching end tag; it simply consolidates both start and end tag into a single tag.

2.1.1.3. Attributes

In addition to text contained within an element's tags, an element can also have attributes. Attributes are included with their respective values within the element's opening declaration (which can also be its closing declaration!). For example, in the chapter tag, the title of the chapter was part of what was noted in an attribute:

<chapter title="Advanced SAX" number="4">
  <topic name="Properties and Features" />
  <topic name="More Handlers" />
  <topic name="Filters and Writers" />
  <topic name="Even More Handlers" />
  <topic name="Gotcha!" />
  <topic name="What&apos;s Next?" />
</chapter>

In this example, title is the attribute name; the value is the title of the chapter, "Advanced SAX." Attribute names must follow the same rules as XML element names, and attribute values must be within quotation marks. Although both single and double quotes are allowed, double quotes are a widely used standard and result in XML documents that model Java programming practices. Additionally, single and double quotation marks may be used in attribute values; surrounding the value in double quotes allows single quotes to be used as part of the value, and surrounding the value in single quotes allows double quotes to be used as part of the value. This is not good practice, though, as XML parsers and processors often uniformly convert the quotes around an attribute's value to all double (or all single) quotes, possibly introducing unexpected results.

In addition to learning how to use attributes, there is an issue of when to use attributes. Because XML allows such a variety of data formatting, it is rare that an attribute cannot be represented by an element, or that an element could not easily be converted to an attribute. Although there's no specification or widely accepted standard for determining when to use an attribute and when to use an element, there is a good rule of thumb: use elements for multiple-valued data and attributes for single-valued data. If data can have multiple values, or is very lengthy, the data most likely belongs in an element. It can then be treated primarily as textual data, and is easily searchable and usable. Examples are the description of a book's chapters, or URLs detailing related links from a site. However, if the data is primarily represented as a single value, it is best represented by an attribute. A good candidate for an attribute is the section of a chapter; while the section item itself might be an element and have its own title, the grouping of chapters within a section could be easily represented by a section attribute within the chapter element. This attribute would allow easy grouping and indexing of chapters, but would never be directly displayed to the user. Another good example of a piece of data that could be represented in XML as an attribute is if a particular table or chair is on layaway. This instruction could let an XML application used to generate a brochure or flier know not to include items on layaway in current stock; obviously this is a true or false value, and has only a singular value at any time. Again, the application client would never directly see this information, but the data would be used in processing and handling the XML document. If after all of this analysis you are still unsure, you can always play it safe and use an element.

You may have already come up with alternate ways to represent these various examples, using different approaches. For example, rather than using a title attribute, it might make sense to nest title elements within a chapter element. Perhaps an empty tag, <layaway />, might be more useful to mark furniture on layaway. In XML, there is rarely only one way to perform data representation, and often several good ways to accomplish the same task. Most often the application and use of the data dictates what makes the most sense. Rather than tell you how to write XML, which would be difficult, I show you how to use XML so you gain insight into how different data formats can be handled and used. This gives you the knowledge to make your own decisions about formatting XML documents.

2.1.1.4. Entity references and constants

One item I have not discussed is escaping characters, or referring to other constant type data values. For example, a common way to represent a path to an installation directory is <path-to-Cocoon>. Here, the user would replace the text with the appropriate choice of installation directory. In this example, the chapter that discusses web applications must give some details on installing and using Apache Cocoon, and might need to represent this data within an element:

<topic>
 <heading>Installing Cocoon</heading>
 <content>
  Locate the Cocoon.properties file in the <path-to-Cocoon>/bin 
  directory.
 </content>
</topic>

The problem is that XML parsers attempt to handle this data as an XML tag, and then generate an error because there is no closing tag. This is a common problem, as any use of angle brackets results in this behavior. Entity references provide a way to overcome this problem. An entity reference is a special data type in XML used to refer to another piece of data. The entity reference consists of a unique name, preceded by an ampersand and followed by a semicolon: &[entity name];. When an XML parser sees an entity reference, the specified substitution value is inserted and no processing of that value occurs. XML defines five entities to address the problem discussed in the example: < for the less-than bracket, > for the greater-than bracket, & for the ampersand sign itself, " for a double quotation mark, and ' for a single quotation mark or apostrophe. Using these special references, you can accurately represent the installation directory reference as:

<topic>
 <heading>Installing Cocoon</heading>
 <content>
  Locate the Cocoon.properties file in the 
  &lt;path-to-Cocoon&gt;/bin directory.
 </content>
</topic>

Once this document is parsed, the data is interpreted as "<path-to-Cocoon>" and the document is still considered well-formed.

Also be aware that entity references are user-definable. This allows a sort of shortcut markup; in the XML example I have been walking through, I reference an external shared copyright text. Because the copyright is used for multiple O'Reilly books, I don't want to include the text within this XML document; however, if the copyright is changed, the XML document should reflect the changes. You may notice that the syntax used in the XML document looks like the predefined XML entity references:

<ora:copyright>&OReillyCopyright;</ora:copyright>

Although you won't see how the XML parser is told what to reference when it sees &OReillyCopyright; until the section on DTDs, you should see that there are more uses for entity references than just representing difficult or unusual characters within data.

2.1.1.5. Unparsed data

The last XML construct to look at is the CDATA section marker. A CDATA section is used when a significant amount of data should be passed on to the calling application without any XML parsing. It is used when an unusually large number of characters would have to be escaped using entity references, or when spacing must be preserved. In an XML document, a CDATA section looks like this:

<unparsed-data>
  <![CDATA[Diagram:
       <Step 1>Install Cocoon to "/usr/lib/cocoon"
       <Step 2>Locate the correct properties file.
       <Step 3>Download Ant from "http://jakarta.apache.org"
                                  -----> Use CVS for this <----
  ]]>
</unparsed-data>

In this example, the information within the CDATA section does not have to use entity references or other mechanisms to alert the parser that reserved characters are being used; instead, the XML parser passes them unchanged to the wrapping program or application.

At this point, you have seen the major components of XML documents. Although each has only been briefly described, this should give you enough information to recognize XML tags when you see them and know their general purpose. With existing resources like O'Reilly's XML in a Nutshell by your side, you are ready to look at some of the more advanced XML specifications.

2.1.2. Namespaces

Although I will not delve too deeply into XML namespaces here, note the use of a namespace in the root element of Example 2-1. An XML namespace is a means of associating one or more elements in an XML document with a particular URI. This effectively means that the element is identified by both its name and its namespace URI. In this XML example, it may be necessary later to include portions of other O'Reilly books. Because each of these books may also have Chapter, Heading, or Topic elements, the document must be designed and constructed in a way to avoid namespace collision problems with other documents. The XML namespaces specification nicely solves this problem. Because the XML document represents a specific book, and no other XML document should represent the same book, using a namespace associated with a URI like http://www.oreilly.com/javaxml2 can create a unique namespace. The namespace specification requires that a unique URI be associated with a prefix to distinguish the elements in the namespace from elements in other namespaces. A URL is recommended, and supplied here:

<book xmlns="http://www.oreilly.com/javaxml2"
      xmlns:ora="http://www.oreilly.com"
>

In fact, I've defined two namespaces. The first is considered the default namespace, because no prefix is supplied. Any element without a prefix is associated with this namespace. As a result, all of the elements in the XML document except the copyright element, prefixed with ora, are in this default namespace. The second defines a prefix, which allows the tag <ora:copyright> to be associated with this second namespace.

A final interesting (and somewhat confusing) point: XML Schema, which I will talk about more in a later section, requires the schema of an XML document to be specified in a manner that looks very similar to a set of namespace declarations; see Example 2-2.

Example 2-2. Referencing an XML Schema

<?xml version="1.0"?>
<addressBook xmlns:xsi="http://www.w3.org/1999/XMLSchema/instance"
             xmlns="http://www.oreilly.com/catalog/javaxml"
             xsi:schemaLocation="http://www.oreilly.com/catalog/javaxml
                                 mySchema.xsd"
>
  <person>
    <name>
      <firstName>Brett</firstName>
      <lastName>McLaughlin</lastName>
    </name>
    <email>brettmclaughlin@earthlink.net</email>
  </person>
  <person>
    <name>
      <firstName>Eddie</firstName>
      <lastName>Balucci</lastName>
    </name>
    <email>eddieb@freeworld.net</email>
  </person>
</addressBook>

Several things happen here, and it is important to understand them all. First, the XML Schema instance namespace is defined and associated with a URL. This namespace, abbreviated xsi, is used for specifying information in XML documents about a schema, exactly as is being done here. Thus, the first line makes the elements in the XML Schema instance available to the document for use. The next line defines the namespace for the XML document itself. Because the document does not use an explicit namespace, like the one associated with http://www.oreilly.com/javaxml2 in earlier examples, the default namespace is declared. This means that all elements without an explicit namespace and associated prefix (all of them, in this example) will be associated with this default namespace.

With both the document and XML Schema instance namespaces defined like this, we can then actually do what we want, which is to associate a schema with this document. The schemaLocation attribute, which belongs to the XML Schema instance namespace, is used to accomplish this. I've prefaced this attribute with its namespace (xsi), which was just defined. The argument to this attribute is actually two URIs: the first specifies the namespace associated with a schema, and the second the URI of the schema to refer to. In the example, this results in the first URI being the default namespace just declared, and the second a file on the local filesystem called mySchema.xsd. Like any other XML attribute, the entire pair is enclosed in a single set of quotation marks. And as simple as that, you have referenced a schema in your XML document!

Seriously, it's not simple, and is to date one of the most misunderstood portions of using namespaces and XML Schema. I look more at the mechanics used here as we continue. For now, keep in mind how namespaces allow elements from various groupings to be used, yet remain identified as a part of them specific grouping.

Chapter 2. Nuts and Bolts

Contents:

Where Did All the Chapters Go?