7.4. Prospects for Improved Web-Search Methods
Part of the hype of XML has been that web search engines will finally understand what a document means by looking at its markup. For instance, you can search for the movie Sneakers and just get back hits about the movie without having to sort through "Internet Wide Area `Tiger Teamers' mailing list," "Children's Side Zip Sneakers Recalled by Reebok," "Infant's `Little Air Jordan' Sneakers Recalled by NIKE," "Sneakers.com - Athletic shoes from Nike, Reebok, Adidas, Fila, New," and the 32,395 other results that Google pulled up on this search that had nothing to do with the movie.
In practice, this is still vapor, mostly because few web pages are available on the frontend in XML, even though more and more backends are XML. The search-engine robots only see the frontend HTML. As this slowly changes, and as the search engines get smarter, we should see more and more useful results. Meanwhile, it's possible to add some XML hints to your HTML pages that knowledgeable search engines can take advantage of using the Resource Description Framework (RDF), the Dublin Core, and the robots processing instruction.
The Resource Description Framework (RDF, http://www.w3.org/RDF/) can be understood as an XML encoding for a particularly simple data model. An RDF document describes resources. Each resource has zero or more properties. Each property has a name and a value. The value may itself be another resource.
The root element of an RDF document is an RDF element. Each resource the RDF element describes is represented as a Description element whose about attribute contains a URI or other identifier pointing to the resource described. Each child element of the Description element represents a property of the resource. The contents of that child element are the value of that property. All RDF elements like RDF and Description are placed in the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace. Property values generally come from other namespaces.
For example, suppose we want to say that the book XML in a Nutshell has the authors W. Scott Means and Elliotte Rusty Harold. In other words, we want to say that the resource identified by the URI urn:isbn:0596002920 has one author property with the value "W. Scott Means" and another author property with the value "Elliotte Rusty Harold." Example 7-10 does this.
Example 7-10. A simple RDF document saying that W. Scott Means and Elliotte Rusty Harold are the authors of XML in a Nutshell
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description about="urn:isbn:0596002920"> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> </rdf:Description> </rdf:RDF>
In this simple example the values of the author properties are merely text. However, they could be XML as well. Indeed, they could be other RDF elements.
There's more to RDF, including containers, schemas, and nested properties. However, this will be sufficient description for web metadata.
7.4.2. Dublin Core
The Dublin Core, http://purl.org/dc/, is a standard set of ten information items with specified semantics that reflect the sort of data you'd be likely to find in a card catalog or annotated bibliography. These are:
Dublin Core can be encoded in a variety of forms including HTML META tags and RDF. Here we concentrate on its encoding in RDF. Typically, each resource is described with an rdf:Description element. This element contains child elements for as many of the Dublin Core information items as are known about the resource. The name of each of these elements matches the name of one of the 14 Dublin Core properties. These are placed in the http://purl.org/dc/elements/1.1/ namespace. Example 7-11 shows an RDF-encoded Dublin Core description of this book.
Example 7-11. An RDF-encoded Dublin Core description for XML in a Nutshell
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description about="urn:isbn:0596002920"> <dc:Title>XML in a Nutshell</dc:Title> <dc:Creator>W. Scott Means</dc:Creator> <dc:Creator>Elliotte Rusty Harold</dc:Creator> <dc:Subject>XML (Document markup language)</dc:Subject>. <dc:Description> A brief tutorial on and quick reference to XML and related technologies and specifications </dc:Description> <dc:Publisher>O'Reilly & Associates</dc:Publisher> <dc:Contributor>Laurie Petrycki</dc:Contributor> <dc:Contributor>Simon St. Laurent</dc:Contributor> <dc:Contributor>Jeni Tennison</dc:Contributor> <dc:Contributor>Susan Hart</dc:Contributor> <dc:Date>2002-04-23</dc:Date> <dc:Type>text</dc:Type> <dc:Format>6" x 9"</dc:Format> <dc:Identifier>0596002920</dc:Identifier> <dc:Language>en-US</dc:Language> <dc:Relation>http://www.oreilly.com/catalog/xmlnut/</dc:Relation> <dc:Coverage>US UK ZA CA AU NZ</dc:Coverage> <dc:Rights>Copyright 2002 O'Reilly & Associates</dc:Rights> </rdf:Description> </rdf:RDF>
There is as yet no standard for how an RDF document should be associated with the XML document it describes. One possibility is for the rdf:RDF element to be embedded in the document it describes, for instance, as a child of the BookInfo element of the DocBook source for this book. Another possibility is that servers provide this meta information through an extra-document channel. For instance, a standard protocol could be defined that would allow search engines to request this information for any page on the site. A convention could be adopted so that for any URL xyz on a given web site, the URL xyz/meta.rdf would contain the RDF-encoded Dublin Core metadata for that URL.
In HTML the robots META tag tells search engines and other robots whether they're allowed to index a page. Walter Underwood has proposed the following processing instruction as an equivalent for XML documents:
<?robots index="yes" follow="no"?>
Robots will look for this in the prolog of any XML document they encounter. The syntax of this particular processing instruction is two pseudoattributes, one named index and one named follow, whose values are either yes or no. If the index attribute has the value yes, then this page will be indexed by a search-engine robot. If index has the value no, then it won't be. Similarly, if follow has the value yes, then links from this document will be followed. If follow has the value no, then they won't be.
Copyright © 2002 O'Reilly & Associates. All rights reserved.