home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Programming PHPProgramming PHPSearch this book

11.3. Parsing XML

Say you have a collection of books written in XML, and you want to build an index showing the document title and its author. You need to parse the XML files to recognize the title and author elements and their contents. You could do this by hand with regular expressions and string functions such as strtok( ), but it's a lot more complex than it seems. The easiest and quickest solution is to use the XML parser that ships with PHP.

PHP's XML parser is based on the Expat C library, which lets you parse but not validate XML documents. This means you can find out which XML tags are present and what they surround, but you can't find out if they're the right XML tags in the right structure for this type of document. In practice, this isn't generally a big problem.

PHP's XML parser is event-based, meaning that as the parser reads the document, it calls various handler functions you provide as certain events occur, such as the beginning or end of an element.

In the following sections we discuss the handlers you can provide, the functions to set the handlers, and the events that trigger the calls to those handlers. We also provide sample functions for creating a parser to generate a map of the XML document in memory, tied together in a sample application that pretty-prints XML.

11.3.1. Element Handlers

When the parser encounters the beginning or end of an element, it calls the start and end element handlers. You set the handlers through the xml_set_element_handler( ) function:

xml_set_element_handler(parser, start_element, end_element);

The start_element and end_element parameters are the names of the handler functions.

The start element handler is called when the XML parser encounters the beginning of an element:

my_start_element_handler(parser, element, attributes);

It is passed three parameters: a reference to the XML parser calling the handler, the name of the element that was opened, and an array containing any attributes the parser encountered for the element. The attribute array is passed by reference for speed.

Example 11-2 contains the code for a start element handler. This handler simply prints the element name in bold and the attributes in gray.

Example 11-2. Start element handler

function start_element($inParser, $inName, &$inAttributes) {
  $attributes = array( );
  foreach($inAttributes as $key) {
    $value = $inAttributes[$key];
    $attributes[] = "<font color=\"gray\">$key=\"$value\" </font>";
  }
  
  echo '&lt;<b>' . $inName . '</b> ' . join(' ', $attributes) . '&gt;';
}

The end element handler is called when the parser encounters the end of an element:

my_end_element_handler(parser, element);

It takes two parameters: a reference to the XML parser calling the handler, and the name of the element that is closing.

Example 11-3 shows an end element handler that formats the element.

Example 11-3. End element handler

function end_element($inParser, $inName) {
  echo '&lt;<b>/$inName</b>&gt;';
}

11.3.4. Entity Handlers

Entities in XML are placeholders. XML provides five standard entities (&amp;, &gt;, &lt;, &quot;, and &apos;), but XML documents can define their own entities. Most entity definitions do not trigger events, and the XML parser expands most entities in documents before calling the other handlers.

Two types of entities, external and unparsed, have special support in PHP's XML library. An external entity is one whose replacement text is identified by a filename or URL rather than explicitly given in the XML file. You can define a handler to be called for occurrences of external entities in character data, but it's up to you to parse the contents of the file or URL yourself if that's what you want.

An unparsed entity must be accompanied by a notation declaration, and while you can define handlers for declarations of unparsed entities and notations, occurrences of unparsed entities are deleted from the text before the character data handler is called.

11.3.4.1. External entities

External entity references allow XML documents to include other XML documents. Typically, an external entity reference handler opens the referenced file, parses the file, and includes the results in the current document. Set the handler with xml_set_external_entity_ref_handler( ), which takes in a reference to the XML parser and the name of the handler function:

xml_set_external_entity_ref_handler(parser, handler);

The external entity reference handler takes five parameters: the parser triggering the handler, the entity's name, the base URI for resolving the identifier of the entity (which is currently always empty), the system identifier (such as the filename), and the public identifier for the entity, as defined in the entity's declaration:

$ok = my_ext_entity_handler(parser, entity, base, system, public);

If your external entity reference handler returns a false value (which it will if it returns no value), XML parsing stops with an XML_ERROR_EXTERNAL_ENTITY_HANDLING error. If it returns true, parsing continues.

Example 11-6 shows how you would parse externally referenced XML documents. Define two functions, create_parser( ) and parse( ), to do the actual work of creating and feeding the XML parser. You can use them both to parse the top-level document and any documents included via external references. Such functions are described later, in Section 11.3.7. The external entity reference handler simply identifies the right file to send to those functions.

Example 11-6. External entity reference handler

function external_entity_reference($inParser, $inNames, $inBase,
                                   $inSystemID, $inPublicID) {
  if($inSystemID) {
    if(!list($parser, $fp) = create_parser($inSystemID)) {
      echo "Error opening external entity $inSystemID \n";
      return false;
    }
  return parse($parser, $fp);
  }
  return false;
}

11.3.6. Options

The XML parser has several options you can set to control the source and target encodings and case folding. Use xml_parser_set_option( ) to set an option:

xml_parser_set_option(parser, option, value);

Similarly, use xml_parser_get_option( ) to interrogate a parser about its options:

$value = xml_parser_get_option(parser, option);

11.3.6.2. Case folding

By default, element and attribute names in XML documents are converted to all uppercase. You can turn off this behavior (and get case-sensitive element names) by setting the XML_OPTION_CASE_FOLDING option to false with the xml_parser_set_option( ) function:

xml_parser_set_option(XML_OPTION_CASE_FOLDING, false);

11.3.7. Using the Parser

To use the XML parser, create a parser with xml_parser_create( ), set handlers and options on the parser, then hand chunks of data to the parser with the xml_parse( ) function until either the data runs out or the parser returns an error. Once the processing is complete, free the parser by calling xml_parser_free( ).

The xml_parser_create( ) function returns an XML parser:

$parser = xml_parser_create([encoding]);

The optional encoding parameter specifies the text encoding ("ISO-8859-1", "US-ASCII", or "UTF-8") of the file being parsed.

The xml_parse( ) function returns TRUE if the parse was successful or FALSE if it was not:

$success = xml_parse(parser, data [, final ]);

The data argument is a string of XML to process. The optional final parameter should be true for the last piece of data to be parsed.

To easily deal with nested documents, write functions that create the parser and set its options and handlers for you. This puts the options and handler settings in one place, rather than duplicating them in the external entity reference handler. Example 11-8 has such a function.

Example 11-8. Creating a parser

function create_parser ($filename) {
  $fp = fopen('filename', 'r');
  $parser = xml_parser_create( );
  
  xml_set_element_handler($parser, 'start_element', 'end_element');
  xml_set_character_data_handler($parser, 'character_data');
  xml_set_processing_instruction_handler($parser, 'processing_instruction');
  xml_set_default_handler($parser, 'default');
  
  return array($parser, $fp);
}
  
function parse ($parser, $fp) {
  $blockSize = 4 * 1024;  // read in 4 KB chunks
  
  while($data = fread($fp, $blockSize)) {  // read in 4 KB chunks
    if(!xml_parse($parser, $data, feof($fp))) {
      // an error occurred; tell the user where
      echo 'Parse error: ' . xml_error_string($parser) . " at line " .
           xml_get_current_line_number($parser));
  
      return FALSE;
    }
  }
  
  return TRUE;
}
  
if (list($parser, $fp) = create_parser('test.xml')) {
  parse($parser, $fp);
  fclose($fp);
  xml_parser_free($parser);
}

11.3.10. Sample Parsing Application

Let's develop a program to parse an XML file and display different types of information from it. The XML file, given in Example 11-9, contains information on a set of books.

Example 11-9. books.xml file

<?xml version="1.0" ?>
<library>
  <book>
    <title>Programming PHP</title>
    <authors>
      <author>Rasmus Lerdorf</author>
      <author>Kevin Tatroe</author>
    </authors>
    <isbn>1-56592-610-2</isbn>
    <comment>A great book!</comment>
  </book>
  <book>
    <title>PHP Pocket Reference</title>
    <authors>
      <author>Rasmus Lerdorf</author>
    </authors>
    <isbn>1-56592-769-9</isbn>
    <comment>It really does fit in your pocket</comment>
  </book>
  <book>
    <title>Perl Cookbook</title>
    <authors>
      <author>Tom Christiansen</author>
      <author>Nathan Torkington</author>
    </authors>
    <isbn>1-56592-243-3</isbn>
    <comment>Hundreds of useful techniques, most just as applicable to
             PHP as to Perl
    </comment>
  </book>
</library>

The PHP application parses the file and presents the user with a list of books, showing just the titles and authors. This menu is shown in Figure 11-1. The titles are links to a page showing the complete information for a book. A page of detailed information for Programming PHP is shown in Figure 11-2.

Figure 11-1

Figure 11-1. Book menu

Figure 11-2

Figure 11-2. Book details

We define a class, BookList, whose constructor parses the XML file and builds a list of records. There are two methods on a BookList that generate output from that list of records. The show_menu( ) method generates the book menu, and the show_book( ) method displays detailed information on a particular book.

Parsing the file involves keeping track of the record, which element we're in, and which elements correspond to records (book) and fields (title, author, isbn, and comment). The $record property holds the current record as it's being built, and $current_field holds the name of the field we're currently processing (e.g., 'title'). The $records property is an array of all the records we've read so far.

Two associative arrays, $field_type and $ends_record, tell us which elements correspond to fields in a record and which closing element signals the end of a record. Values in $field_type are either 1 or 2, corresponding to a simple scalar field (e.g., title) or an array of values (e.g., author) respectively. We initialize those arrays in the constructor.

The handlers themselves are fairly straightforward. When we see the start of an element, we work out whether it corresponds to a field we're interested in. If it is, we set the current_field property to be that field name so when we see the character data (e.g., the title of the book) we know which field it's the value for. When we get character data, we add it to the appropriate field of the current record if current_field says we're in a field. When we see the end of an element, we check to see if it's the end of a record—if so, we add the current record to the array of completed records.

One PHP script, given in Example 11-10, handles both the book menu and book details pages. The entries in the book menu link back to the URL for the menu, with a GET parameter identifying the ISBN of the book whose details are to be displayed.

Example 11-10. bookparse.xml

<html>
<head><title>My Library</title></head>
<body>
<?php
 class BookList {
   var $parser;
   var $record;
   var $current_field = '';
   var $field_type;
   var $ends_record;
   var $records;
  
   function BookList ($filename) {
     $this->parser = xml_parser_create( );
     xml_set_object($this->parser, &$this);
     xml_set_element_handler($this->parser, 'start_element', 'end_element');
     xml_set_character_data_handler($this->parser, 'cdata');
  
     // 1 = single field, 2 = array field, 3 = record container
     $this->field_type = array('title' => 1,
                               'author' => 2,
                               'isbn' => 1,
                               'comment' => 1);
     $this->ends_record = array('book' => true);
  
     $x = join("", file($filename));
     xml_parse($this->parser, $x);
     xml_parser_free($this->parser);
   }
  
   function start_element ($p, $element, &$attributes) {
     $element = strtolower($element);
     if ($this->field_type[$element] != 0) {
       $this->current_field = $element;
     } else {
       $this->current_field = '';
     }
   }
  
   function end_element ($p, $element) {
     $element = strtolower($element);
     if ($this->ends_record[$element]) {
       $this->records[] = $this->record;
       $this->record = array( );
     }
     $this->current_field = '';
   }
  
   function cdata ($p, $text) {
     if ($this->field_type[$this->current_field] === 2) {
       $this->record[$this->current_field][] = $text;
     } elseif ($this->field_type[$this->current_field] === 1) {
       $this->record[$this->current_field] .= $text;
     }
   }
  
   function show_menu( ) {
     echo "<table border=1>\n";
     foreach ($this->records as $book) {
       echo "<tr>";
       $authors = join(', ', $book['author']);
       printf("<th><a href='%s'>%s</a></th><td>%s</td></tr>\n",
              $_SERVER['PHP_SELF'] . '?isbn=' . $book['isbn'],
              $book['title'],
              $authors);
       echo "</tr>\n";
     }
   }
  
   function show_book ($isbn) {
     foreach ($this->records as $book) {
       if ($book['isbn'] !== $isbn) {
         continue;
       }
  
       $authors = join(', ', $book['author']);
       printf("<b>%s</b> by %s.<br>", $book['title'], $authors);
       printf("ISBN: %s<br>", $book['isbn']);
       printf("Comment: %s<p>\n", $book['comment']);
     }
?>
Back to the <a href="<?= $_SERVER['PHP_SELF'] ?>">list of books</a>.<p>
<?
   }
 }; // main program code
  
 $my_library = new BookList ("books.xml");
 if ($_GET['isbn']) {
   // return info on one book
   $my_library->show_book($_GET['isbn']);
 } else {
   // show menu of books
   $my_library->show_menu( );
 }
?>
</body></html>



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.