Where can you get this babelfish to put in your
program's ear? An XML parser
is a program or code library that translates XML data into either a
stream of events or a data object, giving your program direct access
to structured data. The XML can come from one or more files or
filehandles, a character stream, or a static string. It could be
peppered with entity references that may or may not need to be
resolved. Some of the parts could come from outside your computer
system, living in some far corner of the Internet. It could be
encoded in a Latin character set, or perhaps in a Japanese set.
Fortunately for you, the developer, none of these details have to be
accounted for in your program because they are all taken care of by
the parser, an abstract tunnel between the physical state of data and
the crystallized representation seen by your subroutines.
An XML parser acts as a bridge between marked-up data (data packaged
with embedded XML instructions) and some predigested form your
program can work with. In Perl's case, we mean
hashes, arrays, scalars, and objects made of references to these old
friends. XML can be complex, residing in many files or streams, and
can contain unresolved regions (entities) that may need to be patched
up. Also, a parser usually tries to accept only good XML, rejecting
it if it contains well-formedness errors. Its output has to reflect
the structure (order, containment, associative data) while ignoring
irrelevant details such as what files the data came from and what
character set was used. That's a lot of work. To
itemize these points, an XML parser:
-
Reads a stream of characters and distinguishes between markup and
data
-
Optionally replaces entity references with their values
-
Assembles a complete, logical document from many disparate sources
-
Reports syntax errors and optionally reports grammatical (validation)
errors
-
Serves data and structural information to a client program
That leads to the third task. If you allow the parser to resolve
external entities, it will fetch all the documents, local or remote,
that contain parts of the larger XML document. In doing so, all these
entities get smushed into one unbroken document. Since your program
usually doesn't need to know how the document is
distributed physically, information about the physical origin of any
piece of data goes away once it knits the whole document together.
While interpreting the markup, the parser may trip over a syntactic
error. XML was designed to make it very easy to spot such errors.
Everything from attributes to empty element tags have rigid rules for
their construction so a parser doesn't have to think
very hard about it. For example, the following piece of XML has an
obvious error. The start tag for the
<decree> element contains an attribute with
a defective value assignment. The value
"now" is missing a second quote
character, and there's another error, somewhere in
the end tag. Can you see it?
<decree effective="now>All motorbikes
shall be painted red.</decree<