XML on the Web (XML in a Nutshell, 2nd Edition)

XML began as an effort to bring the full power and structure of SGML to the Web in a form that was simple enough for nonexperts to use. Like most great inventions, XML turned out to have uses far beyond what its creators originally envisioned. Indeed, there's a lot more XML off the Web than on it. Nonetheless, XML is still a very attractive language in which to write and serve web pages. Since XML documents must be well-formed and parsers must reject malformed documents, XML pages are less likely to have annoying cross-browser incompatibilities. Since XML documents are highly structured, they're much easier for robots to parse. Since XML tag and attribute names reflect the nature of the content they hold, search-engine spiders can more easily determine the true meaning of a page.

7.1. XHTML

XHTML is an official W3C recommendation. It defines an XML-compatible version of HTML, or rather it redefines HTML as an XML application instead of as an SGML application. Just looking at an XHTML document, you might not even realize that there's anything different about it. It still uses the same <p>, <li>, <table>, <h1>, and other tags with which you're familiar. Elements and attributes have the same, familiar names they have in HTML. The syntax is still basically the same.

The difference is not so much what's allowed but what's not allowed. <p> is a legal XHTML tag, but <P> is not. <table border="0" width="515"> is legal XHTML; <table border=0 width=515> is not. A paragraph prefixed with a <p> and suffixed with a </p> is legal XHTML, but a paragraph that omits the closing </p> tag is not. Most existing HTML documents require substantial editing before they become well-formed and valid XHTML documents. However, once they are valid XHTML documents, they are automatically valid XML documents that can be manipulated with the same editors, parsers, and other tools you use to work with any XML document.

7.1.1. Moving from HTML to XHTML

Most of the changes required to turn an existing HTML document into an XHTML document involve making the document well-formed. For instance, given a legacy HTML document, you'll probably have to make at least some of these changes to turn it into XHTML:

Add missing end-tags like </p> and </li>.
Rewrite elements so that they nest rather than overlap. For example, change <p><em>an emphasized paragraph</p></em> to <p><em>an emphasized paragraph</em></p>.
Put double or single quotes around your attribute values. For example, change <p align=center> to <p align="center">.
Add values (which are the same as the name) to all minimized Boolean attributes. For example, change <input type="checkbox" checked> to <input type="checkbox" checked="checked">.
Replace any occurrences of & or < in character data or attribute values with & and <. For instance, change A&P to A&P and <a href="http://www.google.com/search?client=googlet&q=Java%20XML"> to <a href="http://www.google.com/search?client=googlet&q=Java%20XML">.
Make sure the document has a single root html element.
Change empty elements like <hr> to <hr/> or <hr></hr>.
Add hyphens to comments so that <! this is a comment> becomes .
Encode the document in UTF-8 or UTF-16, or add an XML declaration that specifies in which character set it is encoded.

However, XHTML doesn't merely require well-formedness; it requires validity. In order to create a valid XHTML document, you'll need to make these changes as well:

Add a DOCTYPE declaration to the document pointing to one of the three XHTML DTDs.
Make all element and attribute names lowercase.
Make any other changes you have to make to your markup so that the document validates against the DTD: for example, eliminating nonstandard elements like marquee, adding required attributes like the alt attribute of img, or moving child elements out from inside elements where they're not allowed such as a blockquote inside a p.

In addition, the XHTML specification imposes several requirements that, strictly speaking, are not required for either well-formedness or validity. However, they do make parsing XHTML documents a little easier. These are:

The root element of the document must be html.
There must be a DOCTYPE declaration that uses a PUBLIC ID to identify one of the three XHTML DTDs.
The root element of the document must have an xmlns attribute identifying the default namespace as http://www.w3.org/1999/xhtml.

Finally, if you wish, you may--but do not have to--add an XML declaration or an xml-stylesheet processing instruction to the prolog of your document.

Example 7-1 shows an HTML document from the O'Reilly web site that exhibits many of the validity problems you'll find on the Web today. In fact, this is a much neater page than most. Nonetheless, not all attribute values are quoted. The noshade attribute of the HR element doesn't even have a value. There's no document type declaration. Tags are a mix of upper- and lowercase, mostly uppercase. The DD elements are missing end-tags, and there's some character data inside the second definition that's not part of a DT or a DD.

Example 7-1. A typical HTML document

<HTML><HEAD>
  <TITLE>O'Reilly Shipping Information</TITLE>
</HEAD>
<BODY BGCOLOR="#ffffff" VLINK="#0000CC" LINK="#990000" TEXT="#000000">
<table border=0 width=515>
<tr>
<td>
<IMG SRC="/www/graphics_new/generic_ora_header_wide.gif" BORDER=0>
<H2>U.S. Shipping Information </H2>
<HR size="1" align=left noshade>
<DL>
<DT> <B>UPS Ground Service (Continental US only -- 5-7 business
days):</B></DT>
<DD>
<PRE>
$  5.95 - $ 49.99 ......................... $ 4.50
$ 50.00 - $ 99.99 ......................... $ 6.50
$100.00 - $149.99 ......................... $ 8.50
$150.00 - $199.99 ......................... $10.50
$200.00 - $249.99 ......................... $12.50
$250.00 - $299.99 ......................... $14.50

</PRE>
<DT> <B>Federal Express:</B></DT>
(Shipping within 24 hours of receipt of order by O'Reilly)
<DD>
<PRE>
<EM>1 or 2 books</EM>:
Economy 2-day ............................. $ 8.75
Overnight Standard (Afternoon Delivery) ... $12.75
Overnight Priority (Morning Delivery) ..... $16.50
</PRE>
</DL>
<b>Alaska and Hawaii:</b> add $10 to Federal Express rates.
<P>
<A HREF="int-ship.html"><b>International Shipping Information</b></A>
<P>
<CENTER>
<HR SIZE="1" NOSHADE>
<FONT SIZE="1" FACE="Verdana, Arial, Helvetica">
<A outsideurl=/">
<B>O'Reilly Home</B></A> <B> | </B>
<A outsideurl=/sales/bookstores">
<B>O'Reilly Bookstores</B></A> <B> | </B>
<A outsideurl=/order_new/">
<B>How to Order</B></A> <B> | </B>
<A outsideurl=/oreilly/contact.html">
<B>O'Reilly Contacts<BR></B></A>
<A outsideurl=/international/">
<B>International</B></A> <B> | </B>
<A outsideurl=/oreilly/about.html">
<B>About O'Reilly</B></A> <B> | </B>
<A outsideurl=/affiliates.html">
<B>Affiliated Companies</B></A><p>
<EM>&copy; 2000, O'Reilly &amp; Associates, Inc.</EM>
</FONT>
</CENTER>
</td>
</tr>
</table>

</BODY>
</HTML>

Example 7-2 shows this document after it's been converted to XHTML. All the previously noted problems and a few more besides have been fixed. A number of deprecated presentational attributes, such as the size and noshade attributes of hr, had to be replaced with CSS styles. We've also added the necessary document type and namespace declarations. This document can now be read by both HTML and XML browsers and parsers.

Example 7-2. A valid XHTML document

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<style type="text/css">
  body      {backgroundColor: #FFFFFF; color: #000000}
  a:visited {color: #0000CC}
  a:link    {color: #990000}
</style>
<title>O'Reilly Shipping Information</title>
</head>
<body>
<table border="0" width="515">
<tr>
<td><img src="/www/graphics_new/generic_ora_header_wide.gif"
style="border-width: 0" alt="O'Reilly"/>
<h2>U.S. Shipping Information</h2>

<hr style="height: 1; text-align: left"/>
<dl>
<dt><b>UPS Ground Service (Continental US only -- 5-7 business
days):</b></dt>

<dd>
<pre>
$  5.95 - $ 49.99 ......................... $ 4.50
$ 50.00 - $ 99.99 ......................... $ 6.50
$100.00 - $149.99 ......................... $ 8.50
$150.00 - $199.99 ......................... $10.50
$200.00 - $249.99 ......................... $12.50
$250.00 - $299.99 ......................... $14.50
</pre>
</dd>

<dt><b>Federal Express:</b></dt>

<dd>(Shipping within 24 hours of receipt of order by O'Reilly)</dd>

<dd>
<pre>
<em>1 or 2 books</em>:
Economy 2-day ............................. $ 8.75
Overnight Standard (Afternoon Delivery) ... $12.75
Overnight Priority (Morning Delivery) ..... $16.50

</pre>
</dd>
</dl>

<b>Alaska and Hawaii:</b> add $10 to Federal Express rates.

<p><a href="int-ship.html"><b>International Shipping
Information</b></a></p>

<div style="font-size: xx-small; font-face: Verdana, Arial, Helvetica;
            text-align: center">
<hr style="height: 1"/>
<a
outsideurl=/"><b>O'Reilly Home</b></a> <b>|</b> <a
outsideurl=/sales/bookstores"><b>O'Reilly
Bookstores</b></a> <b>|</b> <a
outsideurl=/order_new/"><b>How to Order</b></a>
<b>|</b> <a outsideurl=/oreilly/contact.html"><b>
O'Reilly Contacts<br />
</b></a> <a outsideurl=/international/"><b>
International</b></a> <b>|</b> <a
outsideurl=/oreilly/about.html"><b>About
O'Reilly</b></a> <b>|</b> <a
outsideurl=/affiliates.html"><b>Affiliated
Companies</b></a></div>

<p style="font-size: xx-small;
          font-family: Verdana, Arial, Helvetica"><em>&copy; 2000,
O'Reilly &amp; Associates, Inc.</em></p>
</td>
</tr>
</table>
</body>
</html>

TIP: Making all these changes can be quite tedious for large documents or collections of many documents. Fortunately, there's an open source tool that can do most of the work for you. Dave Ragget's Tidy, http://tidy.sourceforge.net, is a C program that has been ported to most major operating systems and can convert some pretty nasty HTML into valid XHTML. For example, to convert the file bad.html to good.xml, you would type:
% tidy --output-xhtml yes bad.html good.xml
Tidy fixes as much as it can and warns you about what it can't fix so you can fix it manually--for instance, telling you that a required alt attribute is missing from an img element.

7.1.2. Three DTDs for XHTML

XHTML comes in three flavors, depending on which DTD you choose:

Strict

All three DTDs use the same http://www.w3.org/1999/xhtml namespace. You should choose the strict DTD unless you've got a specific reason to use another one.

7.1.3. Browser Support for XHTML

Many current web browsers, especially Internet Explorer 5.0 and earlier and Netscape 4.79 and earlier, deal inconsistently with XHTML. Certainly they don't require it, accepting as they do such a wide variety of malformed, invalid, and out-and-out mistaken HTML. However, beyond that they do have some problems when they encounter certain common XHTML constructs.

7.1.3.1. The XML declaration and processing instructions

Some browsers display processing instructions and the XML declaration inline. These should be omitted if possible.

Few, if any, browsers recognize or respect the encoding declaration in the XML declaration. Furthermore, many browsers won't automatically recognize UTF-8 or UCS-2 Unicode text. If you use a non-ASCII character set, you should also include a meta element in the header identifying the character set. For example:

<meta http-equiv="Content-type"
      content='text/html; charset=UTF-8'></meta>

7.1.3.2. Empty elements

Browsers deal inconsistently with both forms of empty element syntax. That is, some browsers understand <hr/> but not <hr></hr> (typically rendering it as two horizontal lines rather than one), while others recognize <hr></hr> but not <hr/> (typically omitting the horizontal line completely). The most consistent rendering seems to be achieved by using an empty-element tag with an optional attribute such as class or id, for example, <hr class="empty" />. There's no real reason for the class attribute here, except that its presence keeps browsers from choking on the />. Any other attribute the DTD allows would serve equally well.

On the other hand, if a particular instance of an element happens to be empty, but not all instances of the element have to be empty--for instance, a p that doesn't contain any text--you should use two tags like <p></p> rather than one empty-element tag <p/>.

7.1.3.3. Entity references

Embedded scripts often contain reserved characters like & or < so the document that contains them is not well-formed. However, most JavaScript and VBScript interpreters won't recognize & or < in place of the operators they represent. If the script can't be rewritten without these operators (for instance, by changing a less-than comparison to a greater-than-or-equal-to comparison with the arguments flipped), then you should move to external scripts instead of embedded ones.

Furthermore, most non-XML-aware browsers don't recognize the ' predefined entity reference. You should avoid this if possible and just use the literal ' character instead. The only place this might be a problem is inside attribute values that are enclosed in single quotes because they contain double quotes. However, most browsers do recognize the " entity reference for the " character so you can enclose the attribute value in double quotes and escape the double quotes that are part of the attribute value as ".

7.1.3.4. Other unsupported features

There are a few other subtle differences between how browsers handle XHTML and how XHTML expects to be handled. For instance, XHTML allows character references and CDATA sections although almost no current browsers understand these constructs. However, you're unlikely to encounter these when converting from HTML to XHTML, and you can generally do without them if you're writing XHTML from scratch.

Mozilla, Opera 5.0 and later, Internet Explorer 5.5 and later, and Netscape 6.0 and later can parse and display valid XHTML without any difficulties and without requiring page authors to jump through these hoops. However, since many users have not upgraded their browsers to the level XHTML requires, user-friendly web designers will be jumping through these hoops for some years to come.