1.6. XML Gotchas
This section introduces topics we think you should keep in mind as
you read the book. They are the source of many of the problems
you'll encounter when working with
XML has built-in quality control. A document has to pass some minimal
syntax rules in order to be blessed as well-formed XML. Most parsers
fail to handle a document that breaks any of these rules, so you
should make sure any data you input is of sufficient quality.
- Character encodings
Now that we're in the 21st century, we have to pay
attention to things like character encodings. Gone are the days when
you could be content knowing only about ASCII, the little character
set that could. Unicode is the new king, presiding over
all major character sets of the world. XML prefers to work with
Unicode, but there are many ways to represent it, including
Perl's favorite Unicode encoding, UTF-8. You usually
won't have to think about it, but you should still
be aware of the potential.
Not everyone works with or even knows about
It's a feature in XML whose usefulness is not
immediately obvious, yet it is creeping into our reality slowly but
surely. These devices categorize markup and declare tags to be from
different places. With them, you can mix and match document types,
blurring the distinctions between them. Equations in HTML? Markup as
data in XSLT? Yes, and namespaces are the reason. Older modules
don't have special support for namespaces, but the
newer generation will. Keep it in mind.
Declarations aren't part
of the document per se; they just define pieces of it. That makes
them weird, and something you might not pay enough attention to.
Remember that documents often use DTDs and have declarations for such
things as entities and attributes. If you forget, you could end up
Entities and entity
references seem simple enough: they stand in for content that
you'd rather not type in at that moment. Maybe the
content is in another file, or maybe it contains characters that are
difficult to type. The concept is simple, but the execution can be a
royal pain. Sometimes you want to resolve references and sometimes
you'd rather keep them there. Sometimes a parser
wants to see the declarations; at other times it
doesn't care. Entities can contain other entities to
an arbitrary depth. They're tricky little beasties
and we guarantee that if you don't give careful
thought to how you're going to handle them, they
will haunt you.
According to XML, anything that isn't a markup tag
is significant character data. This fact can lead to some surprising
results. For example, it isn't always clear what
should happen with
default, an XML processor will preserve all of
it -- even the newlines you put after tags to make them more
readable or the spaces you use to indent text. Some parsers will give
you options to ignore space in certain circumstances, but there are
no hard and fast rules.
In the end, Perl and XML are well suited for each other. There may be
a few traps and pitfalls along the way, but with the generosity of
various module developers, your path toward Perl/XML enlightenment
should be well lit.
Copyright © 2002 O'Reilly & Associates. All rights reserved.