External Entity Resolution (Perl and XML)

5.3. External Entity Resolution

By default, the parser substitutes all entity references with their actual values for you. Usually that's what you want it to do, but sometimes, as in the case with our filter example, you'd rather keep the entity references in place. As we saw, keeping the entity references is pretty easy to do; just include an entity_reference( ) handler method to override that behavior by outputting the references again. What we haven't seen yet is how to override the default handling of external entity references. Again, the parser wants to replace the references with their values by locating the files and inserting their contents into the stream. Would you ever want to change that behavior, and if so, how would you do it?

Storing documents in multiple files is convenient, especially for really large documents. For example, suppose you have a big book to write in XML and you want to store each chapter in its own file. You can do so easily with external entities. Here's an example:

<?xml version="1.0"?>
<doctype book [
  <!ENTITY intro-chapter   SYSTEM "chapters/intro.xml">
  <!ENTITY pasta-chapter   SYSTEM "chapters/pasta.xml">
  <!ENTITY stirfry-chapter SYSTEM "chapters/stirfry.xml">
  <!ENTITY soups-chapter   SYSTEM "chapters/soups.xml"> ]>

<book>
  <title>The Bonehead Cookbook</title>
  &intro-chapter;
  &pasta-chapter;
  &stirfry-chapter;
  &soups-chapter;
</book>

The previous filter example would resolve the external entity references for you diligently and output the entire book in one piece. Your file separation scheme would be lost and you'd have to edit the resulting file to break it back into multiple files. Fortunately, we can override the resolution of external entity references using a handler called resolve_entity( ).

This handler has four properties: Name, the entity's name; SystemId and PublicId, identifiers that help you locate the file containing the entity's text; and Base, which helps resolve relative URLs, if any exist. Unlike the other handlers, this one should return a value to tell the parser what to do. Returning undef tells the parser to load the external entity as it normally would. Otherwise, you need to return a hash describing an alternative source from which the entity should be loaded. The hash is the same type you would use to give to the object's parse( ) method, with keys like SystemId to give it a filename or URL, or String to give it a string of text. For example:

sub resolve_entity {
  my( $self, $props ) = @_;
  if( exists( $props->{ SystemId }) and 
      open( ENT, $props->{ SystemId })) {
    my $entval = '<?start-file ' . $props->{ SystemId } . '?>';
    while( <ENT> ) { $entval .= $_; }
    close ENT;
    $entval .= '<?end-file ' . $props->{ SystemId } . '?>';
    return { String => $entval };
  } else {
    return undef;
  }
}

This routine opens the entity resource, if it's in a file it can find, and gives it to the parser as a string. First, it attaches a processing instruction before and after the entity text, marking the boundary of the file. Later, you can write a routine to look for the PIs and separate the files back out again.