11.3. Parsing XMLSay you have a collection of books written in XML, and you want to build an index showing the document title and its author. You need to parse the XML files to recognize the title and author elements and their contents. You could do this by hand with regular expressions and string functions such as strtok( ), but it's a lot more complex than it seems. The easiest and quickest solution is to use the XML parser that ships with PHP. PHP's XML parser is based on the Expat C library, which lets you parse but not validate XML documents. This means you can find out which XML tags are present and what they surround, but you can't find out if they're the right XML tags in the right structure for this type of document. In practice, this isn't generally a big problem. PHP's XML parser is event-based, meaning that as the parser reads the document, it calls various handler functions you provide as certain events occur, such as the beginning or end of an element. In the following sections we discuss the handlers you can provide, the functions to set the handlers, and the events that trigger the calls to those handlers. We also provide sample functions for creating a parser to generate a map of the XML document in memory, tied together in a sample application that pretty-prints XML. 11.3.1. Element HandlersWhen the parser encounters the beginning or end of an element, it calls the start and end element handlers. You set the handlers through the xml_set_element_handler( ) function: xml_set_element_handler(parser, start_element, end_element); The start_element and end_element parameters are the names of the handler functions. The start element handler is called when the XML parser encounters the beginning of an element: my_start_element_handler(parser, element, attributes); It is passed three parameters: a reference to the XML parser calling the handler, the name of the element that was opened, and an array containing any attributes the parser encountered for the element. The attribute array is passed by reference for speed. Example 11-2 contains the code for a start element handler. This handler simply prints the element name in bold and the attributes in gray. Example 11-2. Start element handlerfunction start_element($inParser, $inName, &$inAttributes) { $attributes = array( ); foreach($inAttributes as $key) { $value = $inAttributes[$key]; $attributes[] = "<font color=\"gray\">$key=\"$value\" </font>"; } echo '<<b>' . $inName . '</b> ' . join(' ', $attributes) . '>'; } The end element handler is called when the parser encounters the end of an element: my_end_element_handler(parser, element); It takes two parameters: a reference to the XML parser calling the handler, and the name of the element that is closing. Example 11-3 shows an end element handler that formats the element. Example 11-3. End element handlerfunction end_element($inParser, $inName) { echo '<<b>/$inName</b>>'; } 11.3.2. Character Data HandlerAll of the text between elements (character data, or CDATA in XML terminology) is handled by the character data handler. The handler you set with the xml_set_character_data_handler( ) function is called after each block of character data: xml_set_character_data_handler(parser, handler); The character data handler takes in a reference to the XML parser that triggered the handler and a string containing the character data itself: my_character_data_handler(parser, cdata); Example 11-4 shows a simple character data handler that simply prints the data. Example 11-4. Character data handlerfunction character_data($inParser, $inData) { echo $inData; } 11.3.3. Processing InstructionsProcessing instructions are used in XML to embed scripts or other code into a document. PHP code itself can be seen as a processing instruction and, with the <?php ... ?> tag style, follows the XML format for demarking the code. The XML parser calls the processing instruction handler when it encounters a processing instruction. Set the handler with the xml_set_processing_instruction_handler( ) function: xml_set_processing_instruction(parser, handler); A processing instruction looks like: <?target instructions ?> The processing instruction handler takes in a reference to the XML parser that triggered the handler, the name of the target (for example, "php"), and the processing instructions: my_processing_instruction_handler(parser, target, instructions); What you do with a processing instruction is up to you. One trick is to embed PHP code in an XML document and, as you parse that document, execute the PHP code with the eval( ) function. Example 11-5 does just that. Of course, you have to trust the documents you're processing if you eval( ) code in them. eval( ) will run any code given to it—even code that destroys files or mails passwords to a hacker. Example 11-5. Processing instruction handlerfunction processing_instruction($inParser, $inTarget, $inCode) { if ($inTarget === 'php') { eval($inCode); } } 11.3.4. Entity HandlersEntities in XML are placeholders. XML provides five standard entities (&, >, <, ", and '), but XML documents can define their own entities. Most entity definitions do not trigger events, and the XML parser expands most entities in documents before calling the other handlers. Two types of entities, external and unparsed, have special support in PHP's XML library. An external entity is one whose replacement text is identified by a filename or URL rather than explicitly given in the XML file. You can define a handler to be called for occurrences of external entities in character data, but it's up to you to parse the contents of the file or URL yourself if that's what you want. An unparsed entity must be accompanied by a notation declaration, and while you can define handlers for declarations of unparsed entities and notations, occurrences of unparsed entities are deleted from the text before the character data handler is called. 11.3.4.1. External entitiesExternal entity references allow XML documents to include other XML documents. Typically, an external entity reference handler opens the referenced file, parses the file, and includes the results in the current document. Set the handler with xml_set_external_entity_ref_handler( ), which takes in a reference to the XML parser and the name of the handler function: xml_set_external_entity_ref_handler(parser, handler); The external entity reference handler takes five parameters: the parser triggering the handler, the entity's name, the base URI for resolving the identifier of the entity (which is currently always empty), the system identifier (such as the filename), and the public identifier for the entity, as defined in the entity's declaration: $ok = my_ext_entity_handler(parser, entity, base, system, public); If your external entity reference handler returns a false value (which it will if it returns no value), XML parsing stops with an XML_ERROR_EXTERNAL_ENTITY_HANDLING error. If it returns true, parsing continues. Example 11-6 shows how you would parse externally referenced XML documents. Define two functions, create_parser( ) and parse( ), to do the actual work of creating and feeding the XML parser. You can use them both to parse the top-level document and any documents included via external references. Such functions are described later, in Section 11.3.7. The external entity reference handler simply identifies the right file to send to those functions. Example 11-6. External entity reference handlerfunction external_entity_reference($inParser, $inNames, $inBase, $inSystemID, $inPublicID) { if($inSystemID) { if(!list($parser, $fp) = create_parser($inSystemID)) { echo "Error opening external entity $inSystemID \n"; return false; } return parse($parser, $fp); } return false; } 11.3.4.2. Unparsed entitiesAn unparsed entity declaration must be accompanied by a notation declaration: <!DOCTYPE doc [ <!NOTATION jpeg SYSTEM "image/jpeg"> <!ENTITY logo SYSTEM "php-tiny.jpg" NDATA jpeg> ]> Register a notation declaration handler with xml_set_notation_decl_handler( ): xml_set_notation_decl_handler(parser, handler); The handler will be called with five parameters: my_notation_handler(parser, notation, base, system, public); The base parameter is the base URI for resolving the identifier of the notation (which is currently always empty). Either the system identifier or the public identifier for the notation will be set, but not both. Register an unparsed entity declaration with the xml_set_unparsed_entity_decl_handler( ) function: xml_set_unparsed_entity_decl_handler(parser, handler); The handler will be called with six parameters: my_unp_entity_handler(parser, entity, base, system, public, notation); The notation parameter identifies the notation declaration with which this unparsed entity is associated. 11.3.5. Default HandlerFor any other event, such as the XML declaration and the XML document type, the default handler is called. To set the default handler, call the xml_set_default_handler( ) function: xml_set_default_handler(parser, handler); The handler will be called with two parameters: my_default_handler(parser, text); The text parameter will have different values depending on the kind of event triggering the default handler. Example 11-7 just prints out the given string when the default handler is called. Example 11-7. Default handlerfunction default($inParser, $inData) { echo "<font color=\"red\">XML: Default handler called with '$inData'</font>\n"; } 11.3.6. OptionsThe XML parser has several options you can set to control the source and target encodings and case folding. Use xml_parser_set_option( ) to set an option: xml_parser_set_option(parser, option, value); Similarly, use xml_parser_get_option( ) to interrogate a parser about its options: $value = xml_parser_get_option(parser, option); 11.3.6.1. Character encodingThe XML parser used by PHP supports Unicode data in a number of different character encodings. Internally, PHP's strings are always encoded in UTF-8, but documents parsed by the XML parser can be in ISO-8859-1, US-ASCII, or UTF-8. UTF-16 is not supported. When creating an XML parser, you can give it an encoding to use for the file to be parsed. If omitted, the source is assumed to be in ISO-8859-1. If a character outside the range possible in the source encoding is encountered, the XML parser will return an error and immediately stop processing the document. The target encoding for the parser is the encoding in which the XML parser passes data to the handler functions; normally, this is the same as the source encoding. At any time during the XML parser's lifetime, the target encoding can be changed. Any characters outside the target encoding's character range are demoted by replacing them with a question mark character (?). Use the constant XML_OPTION_TARGET_ENCODING to get or set the encoding of the text passed to callbacks. Allowable values are: "ISO-8859-1" (the default), "US-ASCII", and "UTF-8". 11.3.6.2. Case foldingBy default, element and attribute names in XML documents are converted to all uppercase. You can turn off this behavior (and get case-sensitive element names) by setting the XML_OPTION_CASE_FOLDING option to false with the xml_parser_set_option( ) function: xml_parser_set_option(XML_OPTION_CASE_FOLDING, false); 11.3.7. Using the ParserTo use the XML parser, create a parser with xml_parser_create( ), set handlers and options on the parser, then hand chunks of data to the parser with the xml_parse( ) function until either the data runs out or the parser returns an error. Once the processing is complete, free the parser by calling xml_parser_free( ). The xml_parser_create( ) function returns an XML parser: $parser = xml_parser_create([encoding]); The optional encoding parameter specifies the text encoding ("ISO-8859-1", "US-ASCII", or "UTF-8") of the file being parsed. The xml_parse( ) function returns TRUE if the parse was successful or FALSE if it was not: $success = xml_parse(parser, data [, final ]); The data argument is a string of XML to process. The optional final parameter should be true for the last piece of data to be parsed. To easily deal with nested documents, write functions that create the parser and set its options and handlers for you. This puts the options and handler settings in one place, rather than duplicating them in the external entity reference handler. Example 11-8 has such a function. Example 11-8. Creating a parserfunction create_parser ($filename) { $fp = fopen('filename', 'r'); $parser = xml_parser_create( ); xml_set_element_handler($parser, 'start_element', 'end_element'); xml_set_character_data_handler($parser, 'character_data'); xml_set_processing_instruction_handler($parser, 'processing_instruction'); xml_set_default_handler($parser, 'default'); return array($parser, $fp); } function parse ($parser, $fp) { $blockSize = 4 * 1024; // read in 4 KB chunks while($data = fread($fp, $blockSize)) { // read in 4 KB chunks if(!xml_parse($parser, $data, feof($fp))) { // an error occurred; tell the user where echo 'Parse error: ' . xml_error_string($parser) . " at line " . xml_get_current_line_number($parser)); return FALSE; } } return TRUE; } if (list($parser, $fp) = create_parser('test.xml')) { parse($parser, $fp); fclose($fp); xml_parser_free($parser); } 11.3.8. ErrorsThe xml_parse( ) function will return true if the parse completed successfully or false if there was an error. If something did go wrong, use xml_get_error_code( ) to fetch a code identifying the error: $err = xml_get_error_code( ); The error code will correspond to one of these error constants: XML_ERROR_NONE XML_ERROR_NO_MEMORY XML_ERROR_SYNTAX XML_ERROR_NO_ELEMENTS XML_ERROR_INVALID_TOKEN XML_ERROR_UNCLOSED_TOKEN XML_ERROR_PARTIAL_CHAR XML_ERROR_TAG_MISMATCH XML_ERROR_DUPLICATE_ATTRIBUTE XML_ERROR_JUNK_AFTER_DOC_ELEMENT XML_ERROR_PARAM_ENTITY_REF XML_ERROR_UNDEFINED_ENTITY XML_ERROR_RECURSIVE_ENTITY_REF XML_ERROR_ASYNC_ENTITY XML_ERROR_BAD_CHAR_REF XML_ERROR_BINARY_ENTITY_REF XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF XML_ERROR_MISPLACED_XML_PI XML_ERROR_UNKNOWN_ENCODING XML_ERROR_INCORRECT_ENCODING XML_ERROR_UNCLOSED_CDATA_SECTION XML_ERROR_EXTERNAL_ENTITY_HANDLING The constants generally aren't much use. Use xml_error_string( ) to turn an error code into a string that you can use when you report the error: $message = xml_error_string(code); For example: $err = xml_get_error_code($parser); if ($err != XML_ERROR_NONE) die(xml_error_string($err)); 11.3.9. Methods as HandlersBecause functions and variables are global in PHP, any component of an application that requires several functions and variables is a candidate for object orientation. XML parsing typically requires you to keep track of where you are in the parsing (e.g., "just saw an opening title element, so keep track of character data until you see a closing title element") with variables, and of course you must write several handler functions to manipulate the state and actually do something. Wrapping these functions and variables into a class provides a way to keep them separate from the rest of your program and easily reuse the functionality later. Use the xml_set_object( ) function to register an object with a parser. After you do so, the XML parser looks for the handlers as methods on that object, rather than as global functions: xml_set_object(object); 11.3.10. Sample Parsing ApplicationLet's develop a program to parse an XML file and display different types of information from it. The XML file, given in Example 11-9, contains information on a set of books. Example 11-9. books.xml file<?xml version="1.0" ?> <library> <book> <title>Programming PHP</title> <authors> <author>Rasmus Lerdorf</author> <author>Kevin Tatroe</author> </authors> <isbn>1-56592-610-2</isbn> <comment>A great book!</comment> </book> <book> <title>PHP Pocket Reference</title> <authors> <author>Rasmus Lerdorf</author> </authors> <isbn>1-56592-769-9</isbn> <comment>It really does fit in your pocket</comment> </book> <book> <title>Perl Cookbook</title> <authors> <author>Tom Christiansen</author> <author>Nathan Torkington</author> </authors> <isbn>1-56592-243-3</isbn> <comment>Hundreds of useful techniques, most just as applicable to PHP as to Perl </comment> </book> </library> The PHP application parses the file and presents the user with a list of books, showing just the titles and authors. This menu is shown in Figure 11-1. The titles are links to a page showing the complete information for a book. A page of detailed information for Programming PHP is shown in Figure 11-2. Figure 11-1. Book menuFigure 11-2. Book detailsWe define a class, BookList, whose constructor parses the XML file and builds a list of records. There are two methods on a BookList that generate output from that list of records. The show_menu( ) method generates the book menu, and the show_book( ) method displays detailed information on a particular book. Parsing the file involves keeping track of the record, which element we're in, and which elements correspond to records (book) and fields (title, author, isbn, and comment). The $record property holds the current record as it's being built, and $current_field holds the name of the field we're currently processing (e.g., 'title'). The $records property is an array of all the records we've read so far. Two associative arrays, $field_type and $ends_record, tell us which elements correspond to fields in a record and which closing element signals the end of a record. Values in $field_type are either 1 or 2, corresponding to a simple scalar field (e.g., title) or an array of values (e.g., author) respectively. We initialize those arrays in the constructor. The handlers themselves are fairly straightforward. When we see the start of an element, we work out whether it corresponds to a field we're interested in. If it is, we set the current_field property to be that field name so when we see the character data (e.g., the title of the book) we know which field it's the value for. When we get character data, we add it to the appropriate field of the current record if current_field says we're in a field. When we see the end of an element, we check to see if it's the end of a record—if so, we add the current record to the array of completed records. One PHP script, given in Example 11-10, handles both the book menu and book details pages. The entries in the book menu link back to the URL for the menu, with a GET parameter identifying the ISBN of the book whose details are to be displayed. Example 11-10. bookparse.xml<html> <head><title>My Library</title></head> <body> <?php class BookList { var $parser; var $record; var $current_field = ''; var $field_type; var $ends_record; var $records; function BookList ($filename) { $this->parser = xml_parser_create( ); xml_set_object($this->parser, &$this); xml_set_element_handler($this->parser, 'start_element', 'end_element'); xml_set_character_data_handler($this->parser, 'cdata'); // 1 = single field, 2 = array field, 3 = record container $this->field_type = array('title' => 1, 'author' => 2, 'isbn' => 1, 'comment' => 1); $this->ends_record = array('book' => true); $x = join("", file($filename)); xml_parse($this->parser, $x); xml_parser_free($this->parser); } function start_element ($p, $element, &$attributes) { $element = strtolower($element); if ($this->field_type[$element] != 0) { $this->current_field = $element; } else { $this->current_field = ''; } } function end_element ($p, $element) { $element = strtolower($element); if ($this->ends_record[$element]) { $this->records[] = $this->record; $this->record = array( ); } $this->current_field = ''; } function cdata ($p, $text) { if ($this->field_type[$this->current_field] === 2) { $this->record[$this->current_field][] = $text; } elseif ($this->field_type[$this->current_field] === 1) { $this->record[$this->current_field] .= $text; } } function show_menu( ) { echo "<table border=1>\n"; foreach ($this->records as $book) { echo "<tr>"; $authors = join(', ', $book['author']); printf("<th><a href='%s'>%s</a></th><td>%s</td></tr>\n", $_SERVER['PHP_SELF'] . '?isbn=' . $book['isbn'], $book['title'], $authors); echo "</tr>\n"; } } function show_book ($isbn) { foreach ($this->records as $book) { if ($book['isbn'] !== $isbn) { continue; } $authors = join(', ', $book['author']); printf("<b>%s</b> by %s.<br>", $book['title'], $authors); printf("ISBN: %s<br>", $book['isbn']); printf("Comment: %s<p>\n", $book['comment']); } ?> Back to the <a href="<?= $_SERVER['PHP_SELF'] ?>">list of books</a>.<p> <? } }; // main program code $my_library = new BookList ("books.xml"); if ($_GET['isbn']) { // return info on one book $my_library->show_book($_GET['isbn']); } else { // show menu of books $my_library->show_menu( ); } ?> </body></html> Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|