The problem with HTML is that its tags were designed for
the interaction between humans and machines. When the Web was
invented in the late 1980s, that was just fine. As the Web moved
into all aspects of our lives, HTML was asked to do lots of
strange things. We've all built HTML pages with awkward table
structures, 1-pixel
GIFs, and other nonsense just to get the page to look right in the browser. XML is designed to get us out of this rut and back into the world of structured documents.
Whatever its limitations, HTML is the most popular markup language ever created. Given its popularity, why do we need XML? Consider this extremely informative HTML element:
<td>12304</td>
What does this fascinating piece of content represent?
-
Is it the postal code for Schenectady, New York?
-
Is it the number of light bulbs replaced each month in Las Vegas?
-
Is it the number of Volkswagens sold in Hong Kong last year?
-
Is it the number of tons of steel in the Sydney Harbour Bridge?
The answer: maybe, maybe not. The point of this silly example is that there's no structure to this data. Even if we included the entire table, it takes intelligence (real, live intelligence, the kind between your ears) to make sense of this data. If you saw this cell in a table next to another cell that contained the text "Schenectady," and the heading above the table read "Postal Codes for the State of New York," as a human being, you could interpret the contents of this cell correctly. On the other hand, if you wanted to write a piece of code that took any HTML table and attempted to determine whether any of the cells in the table contained postal codes, you'd find that difficult, to say the least.
Most HTML pages have one goal in mind: the appearance of the document. Veterans of the markup industry know that this is definitely not the way to create content. The separation of content and presentation is a long-established tenet of the publishing industry; unfortunately, most HTML pages aren't even close to approaching this ideal. An XML document should contain information, marked up with tags that describe what all the pieces of information are, as well as the relationship between those items. Presenting the document (also known as rendering) involves rules and decisions separate from the document itself. As we work through dozens of sample documents and applications, you'll see how delaying the rendering decisions as long as possible has significant advantages.
Let's look at another marked-up document. Consider this:
<?xml version="1.0"?>
<postalcodes>
<title>Most-used postal codes in November 2000</title>
<item>
<city>Schenectady</city>
<postalcode>12304</postalcode>
<usage-count>2039</usage-count>
</item>
<item>
<city>Kuala Lumpur</city>
<postalcode>57000</postalcode>
<usage-count>1983</usage-count>
</item>
<item>
<city>London</city>
<postalcode>SW1P 4RG</postalcode>
<usage-count>1722</usage-count>
</item>
...
</postalcodes>
Although we're still in the realm of contrived examples, it would be fairly easy to write a piece of code to find the postal codes in any document that used this set of tags (as opposed to HTML's <table>, <tr>, <td>, etc.). Our code would look for the contents of any <postalcode> elements in the document. (Not to get ahead of ourselves here, but writing an XSLT stylesheet to do this might take all of 30 minutes, including a 25-minute nap.) A well-designed XML document identifies each piece of data in the document and models the relationships between those pieces of data. This means we can be confident that we're processing an XML document correctly.
Again, the key idea here is that we're separating content from presentation. Our XML document clearly delineates the pieces of data and puts them into a format we can parse easily. In this book, we illustrate a number of techniques for transforming this XML document into a variety of formats. Among other things, we can transform the item <postalcode>12304</postalcode> into <td>12304</td>.