Chapter 17. Programming Models
This chapter briefly explains the most popular
programming models for parsing and manipulating XML data in use
today. XML processing includes a diverse set
of tools, which require different approaches but offer distinct
advantages and disadvantages.
TIP:
XML processors of all kinds are available in a wide variety of
languages, including C, C#, C++, COBOL, Haskell, Java, JavaScript
(ECMAScript/JScript), Pascal, Perl, Python, Ruby, SmallTalk, Tcl, and
Visual Basic. If you can't find XML support built
into your programming environment, a quick search will likely locate
a library. XML.com maintains a list of
XML
resources that may be a good place to start at http://www.xml.com/resourceguide/.
17.1. Common XML Processing Models
XML's structured and labeled text can be
processed by developers in several of ways. Programs can look at XML
as text, as a stream of events, as a tree, or as a serialization of
some other structure. Tools supporting all of these options are
widely available.
17.1.1. Treating XML as Text
At their
foundation, XML documents are text. The content and markup are both
represented as text, and text-editing tools can be extremely useful
for XML document inspection, creation, and modification.
XML's textual foundations make it possible for
developers to work with XML directly, using XML-specific tools only
when they choose.
Despite this textual nature, however, XML presents some serious
limitations for programs that attempt to process XML documents as
text documents. It is possible to process extremely simple XML
documents reliably using basic textual tools like
regular expressions, but this
becomes much more difficult as features such as attribute defaulting,
entity processing, and namespaces are added to documents. Using these
features is extremely difficult when treating a document purely as
text.
Textual tools are a key part of the
XML toolset, however. Many developers use text editors such as
vi, Emacs, NotePad, WordPad, BBEdit, and
UltraEdit to create or modify XML documents. Regular expressions
-- in environments such as sed, grep, Perl, and Python --
can be used for search and replace or for tweaking documents prior to
XML parsing or XSLT processing. These tools can also be very useful
for searching and querying the information in XML documents, even
without an understanding of the surrounding structure.
Textual tools may also be applied to the results of an XML
parser. Regular expressions and
similar text-processing tools can be applied usefully to the results
of an XML parse, working on the document when its XML-specific nature
has already been resolved. The W3C's XML Schema, for
instance, includes regular-expression matching as one mechanism for
validating data types, as discussed in Chapter 16.
A smart search and replace or spell checker might process only the
contents of elements (and perhaps attributes), not the markup that
defines the structures.
Text-based processing can be preformed in conjunction with other XML
processing. Parsing and then reserializing XML documents after other
processing has taken place doesn't always produce
the desired results. XSLT, for instance, will remove entity
references and replace them with entity content. Preserving entities
requires replacing them in the original document with unique
placeholders, and then replacing the placeholder as it appears in the
result. With regular expressions, this is quite easy to do.
Developers may also need to replace particular characters with
references to images; this approach can be very useful where an
obscure or nonstandard glyph is needed in XHTML.
WARNING:
XML's dependence on Unicode means that developers
need to be careful about the text-processing tools they choose. Many
development environments have been upgraded to support Unicode, but
there are still tools available that don't. Before
using text-processing tools on the results of an XML parse, make sure
they support Unicode. Text-processing tools being applied to raw XML
documents must support the character encoding used for the document.
17.1.2. Treating XML as Events
As an XML
parser
reads a document, it moves from the beginning of the document to the
end. It may pause to retrieve external resources--for a DTD or
an external entity, for instance--but it builds an
understanding of the document as it moves along. Enforcing
well-formedness and validity constraints and applying namespaces
requires keeping track of context; applying attribute defaults and
entities requires keeping a list of appropriate content to insert;
but the end result is a complete
"reading" of the XML document.
Event-based parsers report this reading as it happens, in a stream of
events representing the information in the document. The
"events" are, for example, the
start of an element, the content of an element, and the end of an
element. For example, given this document:
<name><given>Keith</given><family>Johnson</family></name>
an event-based parser might report events such as this:
startElement:name
startElement:given
content: Keith
endElement:given
startElement:family
content:Johnson
endElement:family
endElement:name
The list and structure of events can become much more complex as
features, such as namespaces, attributes, whitespace between
elements, comments, processing instructions, and entities are added,
but the basic mechanism is quite simple and generally very efficient.
Event-based parsers only have to keep track of a limited amount of
information. They need to understand the contents of DTDs (and
possibly schemas), if the documents use them, and they need to
maintain context stacks for element names and namespace declarations.
They don't need to build a complete record of the
document as they parse it, which minimizes the amount of memory
needed for the parse.
Event-based parsers require the consumer of the events to do a lot
more work, however. Processing events typically means the creation of
a state machine, i.e., code that understands current context and can
route the information in the events to the proper consumer. Because
events occur as the document is read, applications must be prepared
to discard results should a fatal error occur partway through the
document. Applications can't depend on information
that occurs later in a document to interpret the current event,
either, making it hard to use some kinds of XPaths, for instance, in
an event-based environment. These factors can make it difficult to
work directly with event-based parsers.
Despite the potential difficulty, event-based parsers are very useful
for a wide variety of tasks.
Filters can process and modify events
before passing them to another processor, efficiently performing a
wide range of transformations. Filters can be stacked, providing a
relatively simple means of building XML processing pipelines, where
the information from one processor flows directly into another.
Applications that want to feed information directly from XML
documents into their own internal structures may find events to be
the most efficient means of doing that. Even parsers that report XML
documents as complete trees, as described in the next section,
typically build those trees from a stream of events.
TIP:
The Simple API for XML (SAX), described in Chapter 19 and Chapter 25, is the most
commonly used event-based API. SAX2, the current version of SAX, is
hosted at http://www.saxproject.org.Expat, which is a widely used XML parser written in C,
also uses an event-based API. For more information on the expat
parser and its API, see http://www.jclark.com/xml/expat.html.
17.1.3. Treating XML as Tree Models
XML documents,
because of the requirements for
well-formedness, describe tree structures. Documents typically
contain an element that then contains text, attributes, and other
elements, and these may contain elements, text, and attributes, and
so on. Declarations, comments, and processing instructions enrich the
mix, but all basically hold positions in the overall tree.
There are a wide variety of tree models for XML documents. XPath
(described in Chapter 9), used in XSLT
transformations, has a slightly different set of expectations than
does the Document Object Model (DOM) API, which is also different
from the XML Information Set (Infoset), another W3C project. XML
Schema (described in Chapter 16 and Chapter 21) defines a Post-Schema Validation Infoset
(PSVI), which has more information in it (derived from the XML
Schema) than any of the others.
Developers who want to manipulate documents from their programs
typically use APIs that provide access to an object model
representing the XML document. Tree-based APIs typically present a
model of an entire document to an application once parsing has
successfully concluded. Applications don't have to
worry about figuring out context or dealing with rollback when an
error is encountered, since the tree model and parsing already
address those issues. Rather than following a stream of events, an
application can just navigate a tree to find the desired pieces of a
document. Browsers and editors can present or modify the tree in
conformance with user or script requests, using the tree as a
persistent reference to the current content of the document.
Working with a tree model of a document isn't very
different conceptually from working with a document as text. The
entire document is always available, and moving around well-formed
portions of a document or modifying them is fairly easy. The complete
set of context for any given part of the document is always
available. Developers can use XPath expressions to locate content and
make decisions based on content anywhere in the document where APIs
support XPath. (DOM Level 3 adds formal support for XPath, and
various implementations provide their own support.)
Tree models of documents have a few drawbacks. They can take up large
chunks of memory, typically multiplying the original
document's size. Navigating documents can require
additional processing after the parse, as developers have more
options available to them. (Tree models don't impose
the same kinds of discipline as event-based processing.) Both of
these issues can make it difficult to scale and share applications
that rely on tree models, though they may still be appropriate where
small numbers of documents or small documents are being used.
TIP:
The Document Object Model (DOM), described in Chapter 18 and Chapter 24, is the most
common tree-based API. JDOM (http://jdom.org/) and DOM4J (http://dom4j.org/) are Java-only alternatives.
17.1.4. Transformations
Another facility available to the XML
programmer is a form of the XML transformation library. The
Extensible Stylesheet Language Transformation (XSLT) language,
covered in Chapter 8, is the most popular tool
currently available for transforming XML to HTML, XML, or any other
regular language that can be expressed in XSLT. In some cases, using
a transformation to perform pre- or post-processing on XML data when
processing it with either DOM or SAX might be simpler or more
efficient. For instance, XSLT could be used as a preprocessor for a
screen-scraping application that starts from XHTML documents. A
script could extract the meaningful features from the XHTML document
and pour them into an application-specific XML format.
Transformations may be used by themselves, in browsers, or at the
command line, but many XSLT implementations and other transformation
tools offer SAX or DOM interfaces, simplifying the task of using them
to build pipelines.
17.1.6. Standards and Extensions
The SAX
and DOM specifications, along with the various core XML
specifications, provide a foundation for XML processing.
Implementations of these standards, especially implementations of the
DOM, sometimes vary from the specification. Some extensions are
themselves formally specified--Scalable Vector Graphics (SVG),
for instance, specifies extensions to the DOM that are specific to
working with SVG. Others are just kind of tacked on, adding
functionality that a programmer or vendor felt was important but
wasn't in the original specification. The multiple
levels and modules of the DOM have also led to developers claiming
support for the DOM, but actually supporting particular subsets (or
extensions) of the available specifications.
Porting standards also leads to variations. SAX was developed for
Java, and the core SAX project only defines a Java API. The DOM uses
Interface Definition Language (IDL) to define its API, but different
implementations have interpreted the IDL slightly differently. SAX2
and the DOM are somewhat portable, but moving between environments
may require some unlearning and relearning.
Some environments also offer libraries well outside the SAX and DOM
interfaces. Perl and Python both offer libraries that combine event
and tree processing--for instance, permitting applications to
work on partial trees rather than SAX events or full DOM trees.
Microsoft .NET's XMLReader offers similarly flexible
processing. These approaches do not make moving between environments
easy, but they can be very useful.
 |  |  | 16.9. Controlling Type Derivation |  | 17.2. Common XML Processing Issues |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|