The EntityResolver Interface (SAX2)

home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam

Book Home

3.4. The EntityResolver Interface

As mentioned earlier, this interface is used when a parser needs to access and parse external entities in the DTD or document content. It is not used to access the document entity itself. Cases where an EntityResolver should be used include:

When "more local" copies of entity data should be used. Such copies might be from a local filesystem or from a smart caching proxy. A normal web server may be unavailable or may only be accessible through a slow or congested network link; such remote access can cause application slowdowns and failures. This is generically called catalog or cache processing.

When the entity's systemId uses a URI scheme that is not understood by the underlying JVM. Built-in schemes usually include http://, file://, ftp://, and increasingly https://. Schemes not supported by the JVM include urn: and application-specific schemes. (You may need to put such URI schemes into publicID values, in order to prevent problems resolving relative URIs.)

When entities need to be constructed dynamically, or not through the standard URI resolution scheme. For example, entity text might be the result of a query through some user interface or another computation.

When the XML source text doesn't provide usable URIs. SGML-style systems sometimes use system identifiers that aren't really URIs; they might be relative to some base URI other than the base URI of the appropriate entity (document or DTD). Avoid this practice for XML-based systems; it's not very interoperable because most XML processors strongly expect system IDs in XML documents to be valid URIs, relative to the actual base URI of their declaration.

Applications that handle documents with DTDs should plan to use an EntityResolver so they work robustly in the face of partial network failures, and so they avoid placing excessive loads on remote servers. That is, they should try to access local copies of DTD data even when the document specifies a remote one. There are many examples of sloppily written applications that broke when a remote system administrator moved a DTD file. Examples range from purely informative services like most RSS feeds to fee-based services like some news syndication protocols.

You can implement a useful resolver with a data structure as simple as a hash table that maps identifiers to URIs. There is normally no reason to have different parsers use different entity resolvers; documents shouldn't use the same public or (absolute) system identifiers to denote different entities. You'll normally just have one resolver, and it could adaptively cache entities if you like.

More complex catalog facilities may be used by applications that follow the SGML convention that public identifiers are Formal Public Identifiers (FPIs). FPIs serve the role that Universal Resource Names (URNs) serve for Internet-oriented systems. Such mappings can also be used with URIs, if the entity text associated with URIs is as stable as an FPI. (Such stability is one of the goals of URNs.)

Applications pass objects that implement the EntityResolver interface to the XMLReader.setEntityResolver() method. The parser will then use the resolver with all external parsed entities. The EntityResolver interface has only one method, which can throw a java.io.IOException as well as the org.xml.sax.SAXException most other callbacks throw.

InputSource resolveEntity(String publicId, String systemId)

Parsers invoke this method to map entity identifiers either to other identifiers or to data that they will parse. See the discussion in Section 3.1.2, "The InputSource Class", earlier in this chapter, for information about how the InputSource interface is used. If null is returned, then the parser will resolve the systemId without additional assistance. To avoid parsing an entity, return a value that encapsulates a zero-length text entity.

The systemId will always be present and will be a fully resolved URI. The publicId may be null. If it's not null, it will have been normalized by mapping sequences of consecutive whitespace characters to a single space character.

Example 3-3 is an example of a simple resolver that substitutes for a web-based time service running on the local machine by interpreting a private URI scheme and mapping public identifiers to alternative URIs using a dictionary that's externally maintained somehow. (For example, you might prime a hashtable with the public IDs for the XHTML 1.0, XHMTL 1.1, and DocBook 4.0 XML DTDs to point to local files.) It delegates to another resolver for other cases.

Example 3-3. Entity resolver, with chaining

public class MyResolver implements EntityResolver
{
    private EntityResolver next;
    private Dictionary     map;

    // n -- optional resolver to consult on failure 
    // m -- mapping public ids to preferred URLs
    public MyResolver (EntityResolver n, Dictionary m)
	{ next = n; map = m; }

    InputSource resolveEntity (String publicId, String systemId)
    throws SAXException, IOException
    {
	// magic URL?
	if ("http://localhost/xml/date".equals (systemId)) {
	    InputSource	  retval = new InputSource (systemId);
	    Reader 	  date;

	    date = new InputStringReader (new Date().toString ());
	    retval.setCharacterStream (date);
	    return retval;
	}

	// nonstandard URI scheme?
	if (systemId.startsWith ("blob:") {
	    InputSource   retval = new InputSource (systemId);
	    String        key = systemId.substring (5);
	    byte          data [] = Storage.keyToBlob (key);

	    retval.setInputSource (new ByteArrayInputStream (data));
	    return retval;
	}

	// use table to map public id to local URL?
	if (map != null && publicId != null) {
	    String url = (String) map.get (publicId);
	    if (url != null)
		return new InputSource (url);
	}

	// chain to next resolver?
	if (next != null)
	    return next.resolveEntity (publicId, systemId);
	return null;
    }
}

Traditionally, public identifiers are mainly used as keys to find local copies of entities. In SGML, system identifiers were optional and system-specific, so public identifiers were sometimes the only ones available. (XML changed this: system identifiers are mandatory and are URIs.) In essence, public identifiers were used in SGML to serve the role that URNs serve in web-oriented architectures. An ISO standard for FPIs exists, and now RFC 3151 (available at http://www.ietf.org/rfc/rfc3151.txt) defines a mapping from FPIs to URNs. (The FPI is normalized and transformed, then gets a urn:publicid: prefix.) When public identifiers are used with XML systems, it's largely by adopting FPI policies to interoperate with such SGML systems; however, XML public identifiers don't need to be FPIs. You may prefer to use URN schemes in newer systems. If so, be aware that some XML processing engines support only URLs as system identifiers. By letting applications interpret public IDs as URNs, SAX offers more power than some other XML APIs do.

If you want richer catalog-style functionality than the table mapping shown earlier, look for open source implementations of the XML version of the OASIS SGML/Open Catalog (SOCAT). At this time, a specification for such a catalog is a stable draft, still in development; see http://www.oasis.org/committees/entity/ for more information. This specification defines an XML text representation of mappings; the mappings can be significantly more complex than the tabular one shown earlier.

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.