We'll first look at how to turn SAX into generic
DOM (and DOM-like) data structures. If you're
working with such data structures, you may find it's
advantageous to build them using SAX.
With SAX, you can easily discard data you
don't need, filtering it out so you don't need to pay
its costs.
Afterward we'll look briefly at some of the
concerns associated with working with data structures
that are more specialized to your application.
4.4.4. Turning SAX Events into Custom Data Structures
If your application data structure or interchange
syntax is already defined, you may not be able to unmarshal
it using software based on the numerous schema-oriented tools.
However, lots of software uses SAX to do this efficiently.
Once you understand how SAX models data in XML documents, you
can treat unmarshaling much like any other parsing problem.
It's closely associated with marshaling your data structures
to XML.
Here we'll look at some of the issues you may want to consider
when transforming XML into your data structures.
You may find that some individual data items, such as integers
and dates, use the low-level encoding rules that are specified in Part 2
of the W3C XML Schema specification (http://www.w3c.org/TR/xmlschema-2/).
Those encodings are low-level policy decisions, and they're
conceptually independent of the rest of the W3C Schema;
you can use them even if you don't buy the W3C approach to
those schemas. Some other schema systems, such as
Relax-NG, incorporate those low-level encoding policies
without adopting more problematic parts of the W3C XML Schema specification.
Your application might likewise want to use these policies.
One basic high-level encoding
issue is how closely the XML structures
and application structures should match. For example, an element
will be easier to unmarshal by mapping its attributes
(or child elements) directly to properties of a single
application object rather than by mapping them to properties
of several different objects.
The latter design is more complex, and
for many purposes it could be much more appropriate,
but such unmarshaling code needs more complex state.
Regularity of the various structures is another issue.
It's usually less work to handle regular structures,
since it's easy to create general methods and reuse them.
Bugs are less frequent and more easily found than when
every transformation involves yet another special case.
You'll need to figure out how much state you need
to track and what techniques you will use.
You might be able to use extremely simple parsing
state machines; one of these is shown later,
in Example 6-2.
Sometimes it might easier to unmarshal fragments into an
intermediate form (as in the DOM subtrees example earlier),
and map that form to your application structure before
discarding them.
Often some sort of recursive-descent parsing algorithm
that explicitly tracks the state of your parsing activities
will be useful.
It will often be helpful to keep a stack of pending elements
and attributes, as shown later
(in Example 5-1).
But since the XML structures might not map directly
to your application structures, you might also need to stack
objects you're in various stages of unmarshaling.
The worst scenario is when neither the XML text
nor the application data structures are very regular.
Software to work with that kind of system quickly gets
fragile as it grows, and you'll probably want to change
some of your application constraints.