home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeXML in a NutshellSearch this book

Chapter 15. XML as a Data Format

Despite its document roots, the most common applications of XML today involve the storage and transmission of information for use by different software applications and systems. New technologies and frameworks (such as Web Services) depend heavily on XML content to communicate and negotiate between dissimilar applications.

The appropriate techniques used to design, build, and maintain a data-centric XML application vary greatly, depending on the required functionality and intended audience. This chapter discusses the different concerns, techniques, and technologies that should be considered when designing a new data-centric XML application.

15.1. Why Use XML for Data?

Before XML, individual programmers had to determine how data would be formatted whenever they needed to store or transmit program data. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise. A few de facto file formats evolved over the years (RTF, CSV, and the ubiquitous Windows .ini file format), but the data written by one program could usually be read only by that same program. In fact, it was often possible for only that specific version of the same program to read the data.

The rapid proliferation of XML and free XML tools throughout the programming community has given developers an obvious choice when the time comes to select a data-storage or transmission format for their application. For all but the most trivial applications, the benefits of using XML to store and retrieve data far outweigh the additional overhead of including an XML parser in your application. The unique strengths of using XML as a software data format include:

Simple syntax
Easy to generate and parse.

Support for nesting
Tags easily allow programs to represent structures with nested elements.

Easy to debug
Human-readable data format is easy to explore and create with a basic text editor.

Language and platform independent
XML and Unicode guarantee that your datafile will be portable across virtually every popular computer architecture and language combination in use today.

Building on these basic strengths, XML can make possible new types of applications that would have been previously impossible (or very costly) to implement.

TIP: There are a few technologies that seek to achieve similar cross-program compatibility but use binary formats. Abstract Syntax Notation One (ASN.1) is probably the most prominent of these. ISO and ITU-T are developing standards for working with XML and ASN.1 in various combinations; more information on these developments is available from http://asn1.elibel.tm.fr/en/xml/.

15.1.2. Communications

Building flexible communications protocols that link disparate systems has always been a difficult area in computing. With the proliferation of computer networking and the Internet, building distributed systems has become even more important.

While XML itself is only a data format, not a protocol, XML's flexibility and cross-platform usability has inspired some new developments on the protocol front. XML messaging started even before the XML specification was finished, and various forms of XML messaging have continued to evolve.

One of the earliest approaches, and still a common one, was transmitting XML over HTTP POST requests. The sender would assemble an XML document and send it much like HTML form data, and the recipient would process the XML and send back a response, also in XML. Some developers create custom vocabularies for these transactions, while others have moved to standardized vocabularies such as XML-RPC and SOAP.

XML-RPC is a very simple protocol, which uses XML messages traveling on HTTP to represent client-server remote procedure calls (RPC). The XML messages identify methods, parameters, and the results of calling the methods. The XML documents use simple but effective set of data types (including arrays and structs) to pass information between computers. For more information on XML-RPC, see http://www.xmlrpc.com/.

SOAP offers much more flexibility than XML-RPC, but is much more complex as well. SOAP (formerly the Simple Object Access Protocol, but now an acronym without meaning) uses XML to encapsulate information being sent between programs. SOAP is no longer bound to an HTTP transport, but HTTP is commonly used. It offers both an RPC approach and a document-oriented approach and uses XML Schema data types (with some of its own extensions for things like arrays) to identify type information. SOAP is often grouped with Web Services Description Language (WSDL) and Universal Description, Discovery, and Integration (UDDI) in discussions of "Web Services." For information on SOAP and Web Services, see http://www.w3.org/2002/ws/.

TIP: Some developers are promoting the use of HTTP-based alternatives to SOAP and XML-RPC, under the banner of Representational State Transfer (REST). For more information on this architectural approach and the perspective it offers, see http://internet.conveyor.com/RESTwiki/moin.cgi.

The Blocks Extensible Exchange Protocol (BEEP) takes a very different approach from SOAP and XML-RPC. Rather than building documents that travel over existing protocols, BEEP uses XML to build protocols on TCP sockets. BEEP supports HTTP-style message-and-reply, as well as more complex synchronous and asychronous modes of communication. SOAP messages can be transmitted over BEEP, and so can a wide variety of other XML and binary information. More information on BEEP is available at http://www.beepcore.org.

15.1.3. Object Serialization

Like the issue of communications, the question of where and how to store the state of persistent objects has been answered in various ways over the years. With the popular adoption of object-oriented languages, such as C++ and Java, the language and runtime environment frequently handle object-serialization mechanics. Unfortunately, most of these technologies predate XML.

Most existing serialization methods are highly language- and architecture-specific. The serialized object is most often stored in a binary format that is not human readable. These files break easily if corrupted, and maintaining compatibility as the object's structure changes frequently requires custom work on the part of the programmer.

The features that make XML popular as a communications protocol also make it popular as a format for serializing object contents. Viewing the object's contents, making manual modifications, and even repairing damaged files is easy. XML's flexible nature allows the file format to expand ad infinitum while maintaining backward compatibility with older file versions. XML's labeled hierarchies are a clean fit for nested object structures, and conversions from objects to XML and back can be reasonably transparent. (Mapping arbitrary XML to object structures is a much harder problem.)

A number of tools serialize objects written in various environments as XML documents and can recreate the objects from the XML. Java 1.4, for example, adds an "API for Long-Term Persistence" to its java.beans package, giving developers an alternative to its existing (and still supported) opaque binary serialization format. The XML vocabulary looks a lot like Java and is clearly designed for use within a Java framework, though other environments may import and export the serialization. For more information on this API and the XML it produces, see http://java.sun.com/j2se/1.4/docs/guide/beans/changes14.html#ltp.Microsoft's .NET framework includes similar capabilities but uses an XML Schema-based approach.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.