14.4. Writing an XML Parser
The
XML parser example builds on the work of
the XML::Parser library available on CPAN. XML::Parser is an
interface to a library written in
C called expat by
James Clark. Originally Larry Wall wrote the first XML::Parser
library prototype for Perl. Since then, Clark Cooper has continued to
develop and maintain XML::Parser. In this section, we will write a
simple middleware application using XML.
The latest versions of Netscape have a feature called
"What's Related".
When the user clicks on the What's Related button, the Netscape
browser takes the URL that the user is currently viewing and looks up
related URLs in a
search
engine. Most users don't know that the Netscape browser is
actually doing this through an XML-based search engine. Dave Winer
originally wrote an article with accompanying Frontier code to access
the What's Related search engine at http://nirvana.userland.com/whatsRelated/.
Netscape maintains a server that takes URLs and returns the related
URL information in an XML format. Netscape wisely chose XML because
they did not intend for users to interact directly with this server
using HTML forms. Instead, they expected users to choose
"What's Related" as a menu item and then have the
Netscape browser do the XML parsing.
In other words, the Netscape "What's Related" web
server is actually serving as a middleware layer between the search
engine database and the Netscape browser itself. We will write a CGI
frontend to the Netscape application that serves up this XML to
demonstrate the XML parser. In addition, we will also go one step
further and automatically reissue the "What's
Related" query for each URL returned.
Before we jump into the Perl code, we need to take a look at the
XML that is typically returned from the
Netscape server. In this example, we did a search on What's
Related to http://www.eff.org/,
the web site that houses the Electronic Frontier
Foundation. Here is the returned XML:
<RDF:RDF>
<RelatedLinks>
<aboutPage href="http://www.eff.org:80/"/>
<child href="http://www.privacy.org/ipc" name="Internet Privacy Coalition"/>
<child href="http://epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.ciec.org/" name="Citizens Internet Empowerment Coalition"/>
<child href="http://www.cdt.org/" name="The Center for Democracy and Technology"/>
<child href="http://www.freedomforum.org/" name="FREE! The Freedom Forum Online. News about free press"/>
<child href="http://www.vtw.org/speech" name="VTW Focus on Internet Censorship legislation"/>
<child href="http://www.privacyrights.org/" name="Privacy Rights Clearinghouse"/>
<child href="http://www.privacy.org/pi" name="Privacy International Home Page"/>
<child href="http://www.epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.anonymizer.com/" name="Anonymizer, Inc."/>
</RelatedLinks>
</RDF:RDF>
This example is a little different from our plain XML example
earlier. First, there is no DTD. Also, notice that the document is
surrounded with an unusual tag, RDF:
RDF.
This document is actually in an XML-based format called Resource
Description Framework, or RDF. RDF describes resource data, such as
the data from search engines, in a way that is
standard across data domains.
This XML is relatively straightforward. The <aboutPage> tag
contains a reference to the original URL we were searching. The
<child> tag contains references to all the related URLs and
their titles. The <RelatedLinks> tag sandwiches the entire
document
as the
root data structure.
| | | 14.3. Document Type Definition | | 14.5. CGI Gateway to XML Middleware |
Copyright © 2001 O'Reilly & Associates. All rights reserved.
|