URLs, URLConnections, and ContentHandlers (Java Distributed Computing)

The java.net package, in addition to object-oriented representations of IP sockets, also provides objects that support the HTTP protocol for accessing data in the form of addressable documents. HTTP is really an extension of the underlying IP protocol we discussed earlier, designed specifically to provide a way to address different kinds of documents, or pieces of data, distributed on the network. In the rest of this book, we'll see numerous examples of distributed applications whose agents use customized or standard communications protocols to talk to each other. If there is an HTTP server "agent" available on one of the hosts in our distributed application, then we can use the classes discussed in this section to ask it for data documents using the standard HTTP protocol.

To address a specific document or data object, we use a Uniform Resource Locator (URL), which includes four address elements: the protocol, host, port, and document. The Java representation for a URL is the URL class, which is constructed with a given protocol, host, port, and document filename. Once the URL object is constructed, it allows the user to make the necessary requests to connect to the HTTP server of the data object, query for information about the object, and download the object. The content of the object can be accessed using the getContent(), openConnection(), or openStream() methods on the URL object. Of these three methods, openStream() is simplest. The openStream() method returns an InputStream that can be used to read the data contents directly.

When you call openConnection() on a URL object, you get a URLConnection in return. You can use the URLConnection to query the data connection's header information for the data object's length, the type of data it contains, the data encoding, etc. You can also control aspects of the data connection that determine when the data object can be pulled from a local cache, whether input or output is to be done over the data connection, and when unmodified data should be read from the server.

The getContent() method downloads the data object and returns an Object containing the data. Using this method relies upon having a content handler that supports the object's data format and is capable of converting it into a Java object. The java.net package allows you to extend the available content handlers using the ContentHandler and ContentHandlerFactory classes. A ContentHandler accepts a URLConnection, reads the data from the associated data object, and constructs an appropriate Object instance to represent the data object in the Java environment. It is the job of the system-wide ContentHandlerFactory to associate the proper ContentHandler with each data object referenced by a URL. When getContent() is called on a URL or URLConnection object, the ContentHandlerFactory is queried for a ContentHandler that can read the format of the data at the other end of the connection. The ContentHandlerFactory checks the MIME type and encoding of the data object, and returns a ContentHandler for that MIME type. The ContentHandler that's returned is then asked for an Object representing the data by calling its getContent() method with the URLConnection. Typically, the ContentHandler reads the raw data from the URLConnection's InputStream, formats the data into an appropriate object representation, and returns the object to the caller.

Suppose we want to connect to an HTTP server containing computational fluid dynamics (CFD) data files stored in a proprietary format. Suppose these data files have a ".cfd" suffix, and we decide to reserve the MIME type "application/cfd" for these data files. Now, assuming that the HTTP server has been properly configured to export this MIME type in the content headers its transmits, we can use Java's HTTP support to access these data files from our application by creating our own ContentHandler subclass that is capable of reading the data stream and converting it to an appropriate Java object. Example 2-7 shows a CFDContentHandler that does just this. Its getContent() method creates a CFDDataSet object from the data read from the input stream of the URLConnection argument. It assumes that the incoming data is of the expected type and format for the CFDDataSet;a more robust implementation would check the MIME type of the URLConnection and warn the user if the type doesn't match.

Example 2-7. A ContentHandler for CFD File

import java.net.*;
import dcj.examples.Networking.CFDDataSet;

public class CFDContentHandler extends ContentHandler {
  public Object getContent(URLConnection u) {
    CFDDataSet d = new CFDDataSet();
    try {
      InputStream in = u.getInputStream();
      byte[] buffer = new byte[1024];
      while (in.read(buffer) > 0) {
        d.addData(buffer);
      }
    }
    catch (Exception e) {
      e.printStackTrace();
    }

    return d;
  }
}

To use our CFDContentHandler to read CFD files, we still need to register a new ContentHandlerFactory that knows about the CFDContentHandler. The CFDContentHandlerFactory in Example 2-8 creates CFDContent-Handlers for the application/cfd MIME type. It ignores any other MIME types, but we could also implement it with a reference to a default ContentHandlerFactory that can handle other MIME types.

Example 2-8. A Specialized ContentHandlerFactory for CFD Data Files

package dcj.examples.Networking;

import java.net.*;

public class CFDContentHandlerFactory
    implements ContentHandlerFactory {
  public ContentHandler createContentHandler(String mimetype) {
    if (mimetype.compareTo("application/cfd") == 0) {
      return new CFDContentHandler();
    }
    else
      return null;
  }
}

Finally, our application can read CFD data files from an HTTP server by first registering the specialized ContentHandlerFactory, and then requesting a CFD file from the HTTP server on which it lives:

2.2.1. When and Where Are URLs Practical?

As we've seen in earlier sections of this chapter, we can transmit data around a distributed system using sockets and streams. This method has the advantage of being efficient, since we are using basic IP sockets with minimal protocol overhead getting between us and our data. The downside is that it is our responsibility to know the type and format of the data we're transmitting and receiving. The communication protocol must be mutually agreed upon by all participating computing agents, or we have to establish our own means for communicating metadata about the kind of information with which we are dealing.

Java's HTTP support classes, on the other hand, provide a standard means for serving and accessing data objects, and for easily identifying the type and format of these objects. To make a piece of data available from a URL, we need to install it in the content section of an HTTP server, and configure the server to transmit the appropriate MIME type when the data is accessed. On the receiving end, we simply need to use the data object's URL to access the document, ask the corresponding URLConnection for the type and encoding of the data, and respond accordingly. The downside is that HTTP imposes plenty of protocol overhead on the data stream, which reduces our net data bandwidth between computing agents. Our data is now sharing space in network packets with IP protocol and HTTP protocol. Another downside is the relatively basic and simplistic resource naming facility that HTTP provides, compared to formal directory naming services like NIS and LDAP. The simple conclusion is that, for distributed applications that are severely bandwidth-limited, or that need to support complicated resource hierarchies, using the HTTP protocol to access data is probably not the appropriate method. On the other hand, if you have the luxury of some extra communications bandwidth, and the CPU time to use it, and your resource groupings are relatively simple, then using URLs to access data is a possibility you should consider.

2.2. URLs, URLConnections, and ContentHandlers

Example 2-7. A ContentHandler for CFD File

Example 2-8. A Specialized ContentHandlerFactory for CFD Data Files

2.2.1. When and Where Are URLs Practical?