5.2. DTD Handlers
XML::Parser::PerlSAX
supports another group of handlers used to process DTD
events
.
It takes care of anything that appears before the root element, such
as the XML declaration, doctype declaration, and the internal subset
of entity and element declarations, which are collectively called the
document prolog. If you want to output
the document literally as you read it (e.g., in a filter program),
you need to define some of these handlers to reproduce the document
prolog. Defining these handlers is just what we needed in the
previous example.
You can use these handlers for other purposes. For example, you may
need to pre-load entity definitions for special processing rather
than rely on the parser to do its default substitution for you. These
handlers are listed in Table 5-2.
Table 5-2. PerlSAX DTD handlers
Method name
|
Event
|
Properties
|
entity_decl
|
The parser sees an entity declaration (internal or external, parsed
or unparsed).
|
Name, Value, PublicId, SystemId, Notation
|
notation_decl
|
The parser found a notation declaration.
|
Name, PublicId, SystemId, Base
|
unparsed_entity_decl
|
The parser found a declaration for an unparsed entity (e.g., a binary
data entity).
|
Name, PublicId, SystemId, Base
|
element_decl
|
An element declaration was found.
|
Name, Model
|
attlist_decl
|
An element's attribute list declaration was
encountered.
|
ElementName, AttributeName, Type, Fixed
|
doctype_decl
|
The parser found the document type declaration.
|
Name, SystemId, PublicId, Internal
|
xml_decl
|
The XML declaration was encountered.
|
Version, Encoding, Standalone
|
The entity_decl( ) handler is called for all
kinds of entity declarations unless a more specific handler is
defined. Thus, unparsed entity declarations trigger the
entity_decl( ) handler unless
you've defined an unparsed_entity_decl(
), which will take precedence.
entity_decl( )'s parameters
vary depending on the entity type. The Value
parameter is set for internal entities, but not external ones.
Likewise, PublicId and
SystemId, parameters that tell an XML processor
where to find the file containing the entity's
value, is not set for internal entities, only external ones.
Base tells the procesor what to use for a base URL
if the SystemId contains a relative location.
Notation declarations are a special feature of DTDs that allow you to
assign a special type identifier to an entity. For example, you could
declare an entity to be of type
"date" to tell the XML processor
that the entity should be treated as that kind of data.
It's not used very often in XML, so we
won't go into it further.
The Model property of the element_decl(
) contains the content model, or grammar, for an element.
This property describes what is allowed to go inside an element
according to the DTD.
An attribute list declaration in a DTD can contain more than one
attribute description. Fortunately, the parser breaks these
descriptions up into individual calls to the attlist_decl(
) handler for each attribute.
The document type declaration is an optional part of the document at
the top, just under the XML declaration. The parameter
Name is the name of the root element in your
document. PublicId and SystemId
tell the processor where to find the external DTD. Finally, the
Internal parameter contains the whole internal
subset as a string, in case you want to skip the individual entity
and element declaration handling.
As an example, let's say you wanted to add to the
filter example code to output the document prolog exactly as it was
encountered by the parser. You'd need to define
handlers like the program in Example 5-4.
Example 5-4. A better filter
# handle xml declaration
#
sub xml_decl {
my( $self, $properties ) = @_;
output( "<?xml version=\"" . $properties->{'Version'} . "\"" );
my $encoding = $properties->{'Encoding'};
output( " encoding=\"$encoding\"" ) if( $encoding );
my $standalone = $properties->{'Standalone'};
output( " standalone=\"$standalone\"" ) if( $standalone );
output( "?>\n" );
}
#
# handle doctype declaration:
# try to duplicate the original
#
sub doctype_decl {
my( $self, $properties ) = @_;
output( "\n<!DOCTYPE " . $properties->{'Name'} . "\n" );
my $pubid = $properties->{'PublicId'};
if( $pubid ) {
output( " PUBLIC \"$pubid\"\n" );
output( " \"" . $properties->{'SystemId'} . "\"\n" );
} else {
output( " SYSTEM \"" . $properties->{'SystemId'} . "\"\n" );
}
my $intset = $properties->{'Internal'};
if( $intset ) {
$in_intset = 1;
output( "[\n" );
} else {
output( ">\n" );
}
}
#
# handle entity declaration in internal subset:
# recreate the original declaration as it was
#
sub entity_decl {
my( $self, $properties ) = @_;
my $name = $properties->{'Name'};
output( "<!ENTITY $name " );
my $pubid = $properties->{'PublicId'};
my $sysid = $properties->{'SystemId'};
if( $pubid ) {
output( "PUBLIC \"$pubid\" \"$sysid\"" );
} elsif( $sysid ) {
output( "SYSTEM \"$sysid\"" );
} else {
output( "\"" . $properties->{'Value'} . "\"" );
}
output( ">\n" );
}
Now let's see how the output from our filter looks.
The result is in Example 5-5.
Example 5-5. Output from the filter
<?xml version="1.0"?>
<!DOCTYPE book
SYSTEM "/usr/local/prod/sgml/db.dtd"
[
<!ENTITY thingy "hoo hah blah blah">
]>
<book id="mybook">
<title>GRXL in a Nutshell</title>
<chapter id="intro">
<title>What is GRXL?</title>
<comment> need a better title </comment>
<para>
Yet another acronym. That was our attitude at first, but then we saw
the amazing uses of this new technology called
<literal>GRXL</literal>. Consider the following program:
</para>
<programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
print! <lineannotation>wow</lineannotation>
or not!</programlisting>
<comment> what font should we use? </comment>
<para>
What does it do? Who cares? It's just lovely to look at. In fact,
I'd have to say, "&thingy;".
</para>
</chapter>
</book>
That's much better. Now we have a complete filter
program. The basic handlers take care of elements and everything
inside them. The DTD handlers deal with whatever happens
outside of the root element.
 |  |  | 5. SAX |  | 5.3. External Entity Resolution |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|