7.4. Prospects for Improved Web-Search Methods
Part of the hype of XML has been that web search
engines will finally understand what a document means by looking at
its markup. For instance, you can search for the movie
Sneakers and just get back hits about the movie
without having to sort through "Internet Wide Area
`Tiger Teamers' mailing
list," "Children's
Side Zip Sneakers Recalled by Reebok,"
"Infant's `Little
Air Jordan' Sneakers Recalled by
NIKE," "Sneakers.com - Athletic
shoes from Nike, Reebok, Adidas, Fila, New," and the
32,395 other results that Google pulled up on this search that had
nothing to do with the movie.[6]
In practice, this is still vapor, mostly because few web pages are
available on the frontend in XML, even though more and more backends
are XML. The search-engine robots only see the frontend HTML. As this
slowly changes, and as the search engines get smarter, we should see
more and more useful results. Meanwhile, it's
possible to add some XML hints to your HTML pages that knowledgeable
search engines can take advantage of using the Resource Description
Framework (RDF), the Dublin Core, and the robots
processing instruction.
7.4.2. Dublin Core
The Dublin Core, http://purl.org/dc/, is a standard set of ten
information items with specified semantics that reflect the sort of
data you'd be likely to find in a card catalog or
annotated bibliography. These are:
- Title
-
Fairly self-explanatory; this is the name by which the resource is
known. For instance, the title of this book is "XML
in a Nutshell."
- Creator
-
The person or organization who created the resource, e.g., a painter,
author, illustrator, composer, and so on. For instance, the creators
of this book are W. Scott Means and Elliotte Rusty Harold.
- Subject
-
A list of keywords, very likely from some other vocabulary such as
the Dewey Decimal System or Yahoo categories, identifying the topics
of the resource. For instance, using the Library of Congress Subject
Headings vocabulary, the subject of this book is
"XML (Document markup language)."
- Description
-
Typically, a brief amount of text describing the content of the
resource in prose, but it may also include a picture, a table of
contents, or any other description of the resource. For instance, a
description of this book might be "A brief tutorial
on and quick reference to XML and related technologies and
specifications."
- Publisher
-
The name of the person, company, or organization who makes the
resource available. For instance, the publisher of this book is
"O'Reilly &
Associates."
- Contributor
-
A person or organization who made some contribution to the resource
but is not the primary creator of the resource. For example, the
editors of this book, Laurie Petrycki, Simon St.Laurent, and Jeni
Tennison, might be identified as contributors, as would Susan Hart,
the artist who drew the picture on the cover.
- Date
-
The date when the book was created or published, normally given in
the form
YYYY-MM-DD.
For instance, this book's date might be 2002-05-23.
- Type
-
The abstract kind of resource such as image, text, sound, or
software. For instance, a description of this book would have the
type text.
- Format
-
For hard objects like books, the physical dimensions of the resource.
For instance, the paper version of XML in a
Nutshell has the dimensions 6" x 9". For
digital objects like web pages, this is possibly the MIME media type.
For instance, an online version of this book would have the Format
text/html.
- Identifier
-
A formal identifier for the resource, such as an ISBN number, a URI,
or a Social Security number. This book's identifier
is "0596002920."
- Source
-
The resource from which the present resource was derived. For
instance, the French translation of this book might reference the
original English edition as its source.
- Language
-
The language in which this resource is written, typically an ISO-639
language code, optionally suffixed with a hyphen and an ISO-3166
country code. For instance, the language for this book is en-US. The
language for the French translation of this book might be fr-FR.
- Relation
-
A reference to a resource that is in some way related to the current
one, generally using a formal identifier, such as a URI or an ISBN
number. For instance, this might refer to the web page for this book.
- Coverage
-
The location, time, or jurisdiction the resource covers. For
instance, the coverage of this book might be the U.S., Canada,
Australia, the U.K., and Ireland. The coverage of the French
translation of this book might be France, Canada, Haiti, Belgium, and
Switzerland. Generally these will be listed in some formal syntax
such as country codes.
- Rights
-
Information about copyright, patent, trademark and other restrictions
on the content of the resource. For instance, a rights statement
about this book may say "Copyright 2002
O'Reilly & Associates."
Dublin Core can be encoded in a variety of forms
including HTML META tags and RDF. Here we
concentrate on its encoding in RDF. Typically, each resource is
described with an rdf:Description element. This
element contains child elements for as many of the Dublin Core
information items as are known about the resource. The name of each
of these elements matches the name of one of the 14 Dublin Core
properties. These are placed in the http://purl.org/dc/elements/1.1/ namespace.
Example 7-11 shows an RDF-encoded Dublin Core
description of this book.
Example 7-11. An RDF-encoded Dublin Core description for XML in a Nutshell
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description about="urn:isbn:0596002920">
<dc:Title>XML in a Nutshell</dc:Title>
<dc:Creator>W. Scott Means</dc:Creator>
<dc:Creator>Elliotte Rusty Harold</dc:Creator>
<dc:Subject>XML (Document markup language)</dc:Subject>.
<dc:Description>
A brief tutorial on and quick reference to XML and
related technologies and specifications
</dc:Description>
<dc:Publisher>O'Reilly & Associates</dc:Publisher>
<dc:Contributor>Laurie Petrycki</dc:Contributor>
<dc:Contributor>Simon St. Laurent</dc:Contributor>
<dc:Contributor>Jeni Tennison</dc:Contributor>
<dc:Contributor>Susan Hart</dc:Contributor>
<dc:Date>2002-04-23</dc:Date>
<dc:Type>text</dc:Type>
<dc:Format>6" x 9"</dc:Format>
<dc:Identifier>0596002920</dc:Identifier>
<dc:Language>en-US</dc:Language>
<dc:Relation>http://www.oreilly.com/catalog/xmlnut/</dc:Relation>
<dc:Coverage>US UK ZA CA AU NZ</dc:Coverage>
<dc:Rights>Copyright 2002 O'Reilly & Associates</dc:Rights>
</rdf:Description>
</rdf:RDF>
There is as yet no standard for how an RDF document should be
associated with the XML document it describes. One possibility is for
the rdf:RDF element to be embedded in the document
it describes, for instance, as a child of the
BookInfo element of the DocBook source for this
book. Another possibility is that servers provide this meta
information through an extra-document channel. For instance, a
standard protocol could be defined that would allow search engines to
request this information for any page on the site. A convention could
be adopted so that for any URL xyz on a given web site, the URL xyz/meta.rdf would contain the RDF-encoded
Dublin Core metadata for that URL.
 |  |  | 7.3. Authoring Compound Documents with Modular XHTML |  | 8. XSL Transformations |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|