Chapter 15. XML as a Data Format
Despite its document roots, the most common applications of XML today involve the storage and transmission of information for use by different software applications and systems. New technologies and frameworks (such as Web Services) depend heavily on XML content to communicate and negotiate between dissimilar applications.
The appropriate techniques used to design, build, and maintain a data-centric XML application vary greatly, depending on the required functionality and intended audience. This chapter discusses the different concerns, techniques, and technologies that should be considered when designing a new data-centric XML application.
15.1. Why Use XML for Data?
Before XML, individual programmers had to determine how data would be formatted whenever they needed to store or transmit program data. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise. A few de facto file formats evolved over the years (RTF, CSV, and the ubiquitous Windows .ini file format), but the data written by one program could usually be read only by that same program. In fact, it was often possible for only that specific version of the same program to read the data.
The rapid proliferation of XML and free XML tools throughout the programming community has given developers an obvious choice when the time comes to select a data-storage or transmission format for their application. For all but the most trivial applications, the benefits of using XML to store and retrieve data far outweigh the additional overhead of including an XML parser in your application. The unique strengths of using XML as a software data format include:
Building on these basic strengths, XML can make possible new types of applications that would have been previously impossible (or very costly) to implement.
TIP: There are a few technologies that seek to achieve similar cross-program compatibility but use binary formats. Abstract Syntax Notation One (ASN.1) is probably the most prominent of these. ISO and ITU-T are developing standards for working with XML and ASN.1 in various combinations; more information on these developments is available from http://asn1.elibel.tm.fr/en/xml/.
15.1.1. Mixed Environments
Modern enterprise applications often involve software running on different computer systems under various operating systems. Choosing a communication protocol involves finding the lowest common denominator available on each system. With the large number of XML parsers that can be freely integrated with your application, XML is becoming a popular format for enterprise data sharing.
Imagine a typical enterprise application that needs to display data from a mainframe to users connected to a corporate web site. In this case, XML acts as the "glue" to connect a web server with a legacy application on a mainframe. The simple XML interface application accepts requests from the web server, calls the legacy application, and converts the result to XML. Using a technology like XSLT, the web server can then transform the XML into a number of acceptable web formats. By adopting XML as the common language of your enterprise, it becomes easier to reuse existing data in new ways.
Even on smaller systems, XML can be useful for sharing information between applications written in different languages or running in different environments. If a Perl program and a Java program need to communicate, generating and processing XML can be simpler than the alternatives. The XML can also serve as a record to their communications or provide a gateway to other systems that need to join the conversation.
Building flexible communications protocols that link disparate systems has always been a difficult area in computing. With the proliferation of computer networking and the Internet, building distributed systems has become even more important.
While XML itself is only a data format, not a protocol, XML's flexibility and cross-platform usability has inspired some new developments on the protocol front. XML messaging started even before the XML specification was finished, and various forms of XML messaging have continued to evolve.
One of the earliest approaches, and still a common one, was transmitting XML over HTTP POST requests. The sender would assemble an XML document and send it much like HTML form data, and the recipient would process the XML and send back a response, also in XML. Some developers create custom vocabularies for these transactions, while others have moved to standardized vocabularies such as XML-RPC and SOAP.
XML-RPC is a very simple protocol, which uses XML messages traveling on HTTP to represent client-server remote procedure calls (RPC). The XML messages identify methods, parameters, and the results of calling the methods. The XML documents use simple but effective set of data types (including arrays and structs) to pass information between computers. For more information on XML-RPC, see http://www.xmlrpc.com/.
SOAP offers much more flexibility than XML-RPC, but is much more complex as well. SOAP (formerly the Simple Object Access Protocol, but now an acronym without meaning) uses XML to encapsulate information being sent between programs. SOAP is no longer bound to an HTTP transport, but HTTP is commonly used. It offers both an RPC approach and a document-oriented approach and uses XML Schema data types (with some of its own extensions for things like arrays) to identify type information. SOAP is often grouped with Web Services Description Language (WSDL) and Universal Description, Discovery, and Integration (UDDI) in discussions of "Web Services." For information on SOAP and Web Services, see http://www.w3.org/2002/ws/.
TIP: Some developers are promoting the use of HTTP-based alternatives to SOAP and XML-RPC, under the banner of Representational State Transfer (REST). For more information on this architectural approach and the perspective it offers, see http://internet.conveyor.com/RESTwiki/moin.cgi.
The Blocks Extensible Exchange Protocol (BEEP) takes a very different approach from SOAP and XML-RPC. Rather than building documents that travel over existing protocols, BEEP uses XML to build protocols on TCP sockets. BEEP supports HTTP-style message-and-reply, as well as more complex synchronous and asychronous modes of communication. SOAP messages can be transmitted over BEEP, and so can a wide variety of other XML and binary information. More information on BEEP is available at http://www.beepcore.org.
15.1.3. Object Serialization
Like the issue of communications, the question of where and how to store the state of persistent objects has been answered in various ways over the years. With the popular adoption of object-oriented languages, such as C++ and Java, the language and runtime environment frequently handle object-serialization mechanics. Unfortunately, most of these technologies predate XML.
Most existing serialization methods are highly language- and architecture-specific. The serialized object is most often stored in a binary format that is not human readable. These files break easily if corrupted, and maintaining compatibility as the object's structure changes frequently requires custom work on the part of the programmer.
The features that make XML popular as a communications protocol also make it popular as a format for serializing object contents. Viewing the object's contents, making manual modifications, and even repairing damaged files is easy. XML's flexible nature allows the file format to expand ad infinitum while maintaining backward compatibility with older file versions. XML's labeled hierarchies are a clean fit for nested object structures, and conversions from objects to XML and back can be reasonably transparent. (Mapping arbitrary XML to object structures is a much harder problem.)
A number of tools serialize objects written in various environments as XML documents and can recreate the objects from the XML. Java 1.4, for example, adds an "API for Long-Term Persistence" to its java.beans package, giving developers an alternative to its existing (and still supported) opaque binary serialization format. The XML vocabulary looks a lot like Java and is clearly designed for use within a Java framework, though other environments may import and export the serialization. For more information on this API and the XML it produces, see http://java.sun.com/j2se/1.4/docs/guide/beans/changes14.html#ltp.Microsoft's .NET framework includes similar capabilities but uses an XML Schema-based approach.
15.1.4. Data Storage/Retrieval
The line between an XML file and a database can be blurred. Though XML documents are too verbose and searching is too inefficient for high-performance large-scale database applications, they may be used as a simple, self-contained data store for small sets of information.
XML can play a role in the communications between databases and other software, providing usable chunks of information in a form more easily reused than a typical query response. On the client side, XML data files can be used to offload some nontransactional data-search and -retrieval applications from busy web servers down to the desktop web browser. On the server side, XML can be used as an alternate delivery mechanism for query results.
XML is also finding use as a supplement to information stored in relational databases, and more and more relational databases include native support for XML both as a data-retrieval format and a data type. Native XML databases, which store XML documents and provide querying and retrieval tools, are also becoming more widely available. For more information on the wide variety of XML and data-management tools available, see http://www.rpbourret.com/xml/XMLDatabaseProds.htm.
Copyright © 2002 O'Reilly & Associates. All rights reserved.