An Introduction to XML (CGI Programming with Perl)

14.2. An Introduction to XML

XML is useful because it provides an industry standard way of describing data. In addition, XML accomplishes this feat in a style similar to HTML, which thousands of developers are already familiar with. CGI programs that speak XML will be able to deliver to and retrieve data from any XML-compliant Perl script or Java applet.

It is possible to use CGI as middleware without a data description language such as XML. The success of libraries such as LWP for Perl demonstrates this. However, most web pages still deliver data as plain HTML. Using LWP to grab these pages and the HTML::Parser to parse them leaves much to be desired. Although HTML has to be produced in order for a web browser to consume the data even when XML is used, the HTML itself is likely to change depending on how the web designer wants the page to look, even if the data described in XML would still remain the same. For this reason, writing a parser for an HTML document can be problematic because the HTML parser will break as soon as the structure of how the data is displayed is changed.

On the client side of the coin, those projects requiring the sophisticated data-display capabilities of Java need to have some way of obtaining their data. Enabling Java applets to talk to CGI programs provides a lightweight and easy way to gather the data for presentation.

For the most part, HTML has served its purpose well. Web browsers have successfully dealt with HTML markup tags to display content to users for years. However, while human readers can absorb the data in the context of their own language, machines find it difficult to interpret the ambiguity of data written in a natural language such as English inside an HTML document. This problem brought about the recognition that what the Web needs is a language that could mark up content in a way that is easily machine-readable.

XML was designed to make up for many of HTML's limitations in this area. The following is a list of features XML provides that makes it useful as a mechanism for transporting data from program to program:

New tags and tag hierarchies can be defined to represent data specific to your application. For instance, a quiz can contain <QUESTION> and <ANSWER> tags.
Document type definitions can be defined for data validation. You can require, for instance, that every <QUESTION> be associated with exactly one <ANSWER>.
Data transport is Unicode-compliant, which is important for non-ASCII character sets.
Data is provided in a way that makes it easily transportable via HTTP.
Syntax is simple, allowing parsers to be simple.

As an example, let's look at a sample XML document that might contain the data for an online quiz. At the most superficial level, a quiz has to be represented as a collection of questions and their answers. The XML looks like this:

<?xml version="1.0"?>
<!DOCTYPE quiz SYSTEM "quiz.dtd">
<QUIZ>
  <QUESTION TYPE="Multiple">
    <ASK>
      All of the following players won the regular season MVP and playoff
      MVP in the same year, except for:
    </ASK>
    <CHOICE VALUE="A" TEXT="Larry Bird"/>
    <CHOICE VALUE="B" TEXT="Jerry West"/>
    <CHOICE VALUE="C" TEXT="Earvin Magic Johnson"/>
    <CHOICE VALUE="D" TEXT="Hakeem Olajuwon"/>
    <CHOICE VALUE="E" TEXT="Michael Jordan"/>
    
    <ANSWER>B</ANSWER>
    <RESPONSE VALUE="B">
      West was awesome, but they did not have a playoff 
      MVP in his day.
    </RESPONSE>
    <RESPONSE STATUS="WRONG">
      How could you choose Bird, Magic, Michael, or Hakeem?
    </RESPONSE>
  </QUESTION>
  
  <QUESTION TYPE="Text">
    <ASK>
      Who is the only NBA player to get a triple-double by halftime?
    </ASK>
    
    <ANSWER>Larry Bird</ANSWER>
     <RESPONSE VALUE="Larry Bird">
       You got it! He was quite awesome!
     </RESPONSE>
     <RESPONSE VALUE="Magic Johnson">
       Sorry. Magic was just as awesome as Larry, but he never got a
       triple-double by halftime.
     </RESPONSE>
     <RESPONSE STATUS="WRONG">
       I guess you are not a Celtics Fan.
     </RESPONSE>
  </QUESTION>
</QUIZ>

You can see from the above document that XML is actually very simple, and it is very similar to HTML. This is no accident. One of XML's primary design goals is to make it compatible with the Internet. The other major goal is to make the language so simple that it is relatively trivial to write an XML parser.

From the structure in the sample XML document, you can ascertain that the root data structure is a quiz surrounded by <QUIZ> tags. All XML documents must present the data with at least one root structure surrounding the whole document.

Within the quiz structure shown here, there are two questions. Within those questions are descriptions of the question itself, an answer to the question, and a host of possible responses.

Obviously, this input has to be accompanied by a style sheet or some other guide to the browser, so that the browser knows basic things like not displaying the answers with the questions. Later in this chapter, we will write a Perl program to translate an XML document into standard HTML.

The question tags are written with an open and closing tag to illustrate that multiple datasets (ask, answer, response) are placed between them. On the other hand, we made the choices for a multiple-choice question into single, empty tags. XML makes this clear by forcing a "/" at the end of the single tag definition.

This is one of the main areas where XML differs from HTML. HTML would just leave the single empty tag as is. However, the designers of XML felt that it was easier to write a parser if that parser knew that it did not have to look for a closing tag to accommodate the start tag as soon as it realized the single tag ends with a "/>" instead of ">" by itself.

The above XML document is arbitrarily structured. We could have presented the information in different ways.

For example, we could have made the <CHOICE> tag open instead of empty so that a choice could handle more definitions inside of itself. Using an open tag would allow a round-robin list of possible choices to present so the choices do not appear the same all the time. This is an important XML point: XML was designed to accommodate any data structure.