Writing an XML Parser (CGI Programming with Perl)

14.4. Writing an XML Parser

The XML parser example builds on the work of the XML::Parser library available on CPAN. XML::Parser is an interface to a library written in C called expat by James Clark. Originally Larry Wall wrote the first XML::Parser library prototype for Perl. Since then, Clark Cooper has continued to develop and maintain XML::Parser. In this section, we will write a simple middleware application using XML.

The latest versions of Netscape have a feature called "What's Related". When the user clicks on the What's Related button, the Netscape browser takes the URL that the user is currently viewing and looks up related URLs in a search engine. Most users don't know that the Netscape browser is actually doing this through an XML-based search engine. Dave Winer originally wrote an article with accompanying Frontier code to access the What's Related search engine at http://nirvana.userland.com/whatsRelated/.

Netscape maintains a server that takes URLs and returns the related URL information in an XML format. Netscape wisely chose XML because they did not intend for users to interact directly with this server using HTML forms. Instead, they expected users to choose "What's Related" as a menu item and then have the Netscape browser do the XML parsing.

In other words, the Netscape "What's Related" web server is actually serving as a middleware layer between the search engine database and the Netscape browser itself. We will write a CGI frontend to the Netscape application that serves up this XML to demonstrate the XML parser. In addition, we will also go one step further and automatically reissue the "What's Related" query for each URL returned.

Before we jump into the Perl code, we need to take a look at the XML that is typically returned from the Netscape server. In this example, we did a search on What's Related to http://www.eff.org/, the web site that houses the Electronic Frontier Foundation. Here is the returned XML:

<RDF:RDF>
<RelatedLinks>
<aboutPage href="http://www.eff.org:80/"/>
<child href="http://www.privacy.org/ipc" name="Internet Privacy Coalition"/>
<child href="http://epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.ciec.org/" name="Citizens Internet Empowerment Coalition"/>
<child href="http://www.cdt.org/" name="The Center for Democracy and Technology"/>
<child href="http://www.freedomforum.org/" name="FREE! The Freedom Forum Online. News about free press"/>
<child href="http://www.vtw.org/speech" name="VTW Focus on Internet Censorship legislation"/>
<child href="http://www.privacyrights.org/" name="Privacy Rights Clearinghouse"/>
<child href="http://www.privacy.org/pi" name="Privacy International Home Page"/>
<child href="http://www.epic.org/" name="Electronic Privacy Information Center"/>
<child href="http://www.anonymizer.com/" name="Anonymizer, Inc."/>
</RelatedLinks>
</RDF:RDF>

This example is a little different from our plain XML example earlier. First, there is no DTD. Also, notice that the document is surrounded with an unusual tag, RDF: RDF. This document is actually in an XML-based format called Resource Description Framework, or RDF. RDF describes resource data, such as the data from search engines, in a way that is standard across data domains.

This XML is relatively straightforward. The <aboutPage> tag contains a reference to the original URL we were searching. The <child> tag contains references to all the related URLs and their titles. The <RelatedLinks> tag sandwiches the entire document as the root data structure.