Internationalization with XSLT (Java and XSLT)

8.6.3. Putting It All Together

Getting all of the pieces to work together is often the trickiest aspect of i18n. To demonstrate the concepts, we will now look at XML datafiles, XSLT stylesheets, and a servlet that work together to support any combination of English, Chinese, and Spanish. A basic HTML form makes it possible for users to select which XML file and XSLT stylesheet will be used to perform a transformation. The screen shot in Figure 8-9 shows what this web page looks like.

Figure 8-9. XML and XSLT language selection

As you can see, there are three versions of the XML data, one for each language. Other than the language, the three files are identical. There are also three versions of the XSLT stylesheet, and the user can select any combination of XML and XSLT language. The character encoding for the resulting transformation is also configurable. UTF-8 and UTF-16 are compatible with Unicode and can display the Spanish and Chinese characters directly. ISO-8859-1, however, can display only extended character sets using entities such as 文.

In this example, users explicitly specify their language preference. It is also possible to write a servlet that uses the Accept-Language HTTP header, which may contain a list of preferred languages:

en, es, ja

From this list, the application can attempt to select the appropriate language and character encoding without prompting the user. Chapter 13 of Java Servlet Programming, Second Edition by Jason Hunter (O'Reilly) presents a detailed discussion of this technique along with a class called LocaleNegotiator that maps more than 30 language codes to their appropriate character encodings.

In Figure 8-10, the results of three different transformations are displayed. In the first window, a Chinese XSLT stylesheet is applied to a Chinese XML datafile. In the second window, the English version of the XSLT stylesheet is applied to the Spanish XML data. Finally, the Spanish XSLT stylesheet is applied to the Chinese XML data.

Figure 8-10. Several language combinations

The character encoding is generally transparent to the user. Switching to a different encoding makes no difference to the output displayed in Figure 8-10. However, it does make a difference when the page source is viewed. For example, when the output is UTF-8, the actual Chinese or Spanish characters are displayed in the source of the HTML page. When using ISO-8859-A, however, the source code looks something like this:

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>&#20013;&#25991;XSLT</title>
</head>
<body>
<h1>&#20013;&#25991;XSLT</h1>
...remainder of page omitted

As you can see, the Chinese characters are replaced by their corresponding character entities, such as 中. The XSLT processor creates these entities automatically when the output encoding type cannot display the characters directly.

Browser Fonts

Recent versions of any major web browser can display UTF-8 and UTF-16 encoded characters without problems. Font configuration is the primary concern. If you are using Internet Explorer, be sure to select the View Encoding Auto Select menu option. Under Netscape 6, the View Character Coding Auto Detect menu option is comparable. If you run the examples and see question marks and garbled text, this is a good indication that the proper fonts are not installed on your system.

For the Chinese examples shown in this chapter, the Windows 2000 SimHei and SimSun fonts were installed. These and many other fonts are included with Windows 2000 but are not automatically installed unless the appropriate language settings are selected under the regional options window. This window can be found in the Windows 2000 Control Panel. A good source for font information on other versions of Windows is Fontboard at http://www.geocities.com/fontboard.

Sun Solaris users should start at the Sun Global Application Developer Corner web site at http://www.sun.com/developers/gadc/. This offers information on internationalization support in the latest versions of the Solaris operating system. For other versions of Unix or Linux, a good starting point is the Netscape 6 Help menu. The International Users option brings up a web page that provides numerous sources of fonts for various versions of Unix and Linux on which Netscape runs.

8.6.3.1. XML data

Each of the three XML datafiles used by this example follows the format shown in Example 8-20. As you can see, the XML data merely lists translations from English to another language. All three files follow the same naming convention: numbers_english.xml, numbers_spanish.xml, and numbers_chinese.xml.

Example 8-20. numbers_spanish.xml

<?xml version="1.0" encoding="UTF-8"?>
<numbers>
  <language>Español (Spanish)</language>
  <number english="one">uno</number>
  <number english="two">dos</number>
  <number english="three">tres</number>
  <number english="four">cuatro</number>
  <number english="five">cinco</number>
  <number english="six">seis</number>
  <number english="seven">siete</number>
  <number english="eight">ocho</number>
  <number english="nine">nueve</number>
  <number english="ten">diez</number>
</numbers>

8.6.3.2. XSLT stylesheets

The numbers_english.xslt stylesheet is shown in Example 8-21 and follows the same pattern that was introduced earlier in this chapter. Specifically, it isolates locale-specific data as a series of variables.

Example 8-21. numbers_english.xslt

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="UTF-8"/>
  
  <xsl:variable name="lang.pageTitle">XSLT in English</xsl:variable>
  <xsl:variable name="lang.tableCaption">
    Here is a table of numbers:
  </xsl:variable>
  <xsl:variable name="lang.englishHeading">English</xsl:variable>
  
  <xsl:template match="/">
    <html>
      <head>
        <title><xsl:value-of select="$lang.pageTitle"/></title>
      </head>
      <body>
        <xsl:apply-templates select="numbers"/>
      </body>
    </html>
  </xsl:template>
  <xsl:template match="numbers">
    <h1><xsl:value-of select="$lang.pageTitle"/></h1>
    <xsl:value-of select="$lang.tableCaption"/>
    <table border="1">
      <tr>
        <th><xsl:value-of select="$lang.englishHeading"/></th>
        <th>
          <xsl:value-of select="language"/>
        </th>
      </tr>
      <xsl:apply-templates select="number"/>
    </table>
  </xsl:template>
  <xsl:template match="number">
    <tr>
      <td>
        <xsl:value-of select="@english"/>
      </td>
      <td>
        <xsl:value-of select="."/>
      </td>
    </tr>
  </xsl:template>
</xsl:stylesheet>

As you can see, the default output encoding of this stylesheet is UTF-8. This can (and will) be overridden by the servlet, however. The Spanish stylesheet, numbers_spanish.xslt, is shown in Example 8-22.

Example 8-22. numbers_spanish.xslt

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:import href="numbers_english.xslt"/>
  
  <xsl:variable name="lang.pageTitle">XSLT en Español</xsl:variable>
  <xsl:variable name="lang.tableCaption">
    Aquí está un vector de números:
  </xsl:variable>
  <xsl:variable name="lang.englishHeading">Inglés</xsl:variable>

</xsl:stylesheet>

The Chinese stylesheet, numbers_chinese.xslt, is not listed here because it is structured exactly like the Spanish stylesheet. In both cases, numbers_english.xslt is imported, and the three variables are overridden with language-specific text.

8.6.3.3. Web page and servlet

The user begins with the web page that was shown in Figure 8-9. The HTML source for this page is listed in Example 8-23. The language and encoding selections are posted to a servlet when the user clicks on the Submit button.

Example 8-23. i18n.html

<html>
<head>
<title>Internationalization Demo</title>
</head>
<body>
<form method="post" action="/chap8/languageDemo">
  <table border="1">
    <tr valign="top">
    <td>XML Language:</td>
    <td>
      <input type="radio" name="xmlLanguage" 
             checked="checked" value="english"> English<br />
      <input type="radio" name="xmlLanguage" value="spanish"> Spanish<br />
      <input type="radio" name="xmlLanguage" value="chinese"> Chinese
    </td>
  </tr>
  
  <tr valign="top">
    <td>XSLT Language:</td>
    <td>
      <input type="radio" name="xsltLanguage" 
              checked="checked" value="english"> English<br />
      <input type="radio" name="xsltLanguage" value="spanish"> Spanish<br />
      <input type="radio" name="xsltLanguage" value="chinese"> Chinese
    </td>
  </tr>

  <tr valign="top">
    <td>Character Encoding:</td>
    <td>
      <input type="radio" name="charEnc" value="ISO-8859-1"> ISO-8859-1<br />
      <input type="radio" name="charEnc" value="UTF-8" 
               checked="checked"> UTF-8<br />
      <input type="radio" name="charEnc" value="UTF-16"> UTF-16<br />
    </td>
  </tr>
  </table>
  
  <p>
  <input type="submit" name="submitBtn" value="Submit">
  </p>
</form>
</body>
</html>

The servlet, LanguageDemo.java, is shown in Example 8-24. This servlet accepts input from the i18n.html web page and then applies the XSLT transformation.

Example 8-24. LanguageDemo.java servlet

package chap8;

import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;

/**
 * Allows any combination of English, Spanish, and Chinese XML
 * and XSLT.
 */
public class LanguageDemo extends HttpServlet {

    public void doPost(HttpServletRequest req, HttpServletResponse res)
            throws ServletException, IOException {
        ServletContext ctx = getServletContext( );

        // these are all required parameters from the HTML form
        String xmlLang = req.getParameter("xmlLanguage");
        String xsltLang = req.getParameter("xsltLanguage");
        String charEnc = req.getParameter("charEnc");

        // convert to system-dependent path names
        String xmlFileName = ctx.getRealPath(
                "/WEB-INF/xml/numbers_" + xmlLang + ".xml");
        String xsltFileName = ctx.getRealPath(
                "/WEB-INF/xslt/numbers_" + xsltLang + ".xslt");

        // do this BEFORE calling HttpServletResponse.getWriter( )
        res.setContentType("text/html; charset=" + charEnc);

        try {
            Source xmlSource = new StreamSource(new File(xmlFileName));
            Source xsltSource = new StreamSource(new File(xsltFileName));

            TransformerFactory transFact = TransformerFactory.newInstance( );
            Transformer trans = transFact.newTransformer(xsltSource);

            trans.setOutputProperty(OutputKeys.ENCODING, charEnc);

            // note: res.getWriter( ) will use the encoding type that was
            //       specified earlier in the call to res.setContentType( )
            trans.transform(xmlSource, new StreamResult(res.getWriter( )));

        } catch (TransformerConfigurationException tce) {
            throw new ServletException(tce);
        } catch (TransformerException te) {
            throw new ServletException(te);
        }
    }
}

After getting the three request parameters for XML, XSLT, and encoding, the servlet converts the XML and XSLT names to actual filenames:

String xmlFileName = ctx.getRealPath(
        "/WEB-INF/xml/numbers_" + xmlLang + ".xml");
String xsltFileName = ctx.getRealPath(
        "/WEB-INF/xslt/numbers_" + xsltLang + ".xslt");

Because the XML files and XSLT stylesheets are named consistently, it is easy to determine the filenames. The next step is to set the content type of the response:

// do this BEFORE calling HttpServletResponse.getWriter( )
res.setContentType("text/html; charset=" + charEnc);

This is a critical step that instructs the servlet container to send the response to the client using the specified encoding type. This gets inserted into the Content-Type HTTP response header, allowing the browser to determine which encoding to expect. In our example, the three possible character encodings result in the following possible content types:

Content-Type: text/html; charset=ISO-8869-1
Content-Type: text/html; charset=UTF-8
Content-Type: text/html; charset=UTF-16

Next, the servlet uses the javax.xml.transform.Source interface and the javax.xml.transform.stream.StreamSource class to read from the XML and XSLT files:

Source xmlSource = new StreamSource(new File(xmlFileName));
Source xsltSource = new StreamSource(new File(xsltFileName));

By using java.io.File, the StreamSource will correctly determine the encoding of the XML and XSLT files by looking at the XML declaration within each of the files. The StreamSource constructor also accepts InputStream or Reader as parameters. Special precautions must be taken with the Reader constructors, because Java Reader implementations use the default Java character encoding, which is determined when the VM starts up. The InputStreamReader is used to explicitly specify an encoding as follows:

Source xmlSource = new StreamSource(new InputStreamReader(
        new FileInputStream(xmlFileName), "UTF-8"));

For more information on how Java uses encodings, see the JavaDoc package description for the java.lang package.

Our servlet then overrides the XSLT stylesheet's output encoding as follows:

trans.setOutputProperty(OutputKeys.ENCODING, charEnc);

This takes precedence over the encoding that was specified in the <xsl:output> element shown earlier in Example 8-21.

Finally, the servlet performs the transformation, sending the result tree to a Writer obtained from HttpServletResponse:

// note: res.getWriter( ) will use the encoding type that was
//       specified earlier in the call to res.setContentType( )
trans.transform(xmlSource, new StreamResult(res.getWriter( )));

As the comment indicates, the servlet container should set up the Writer to use the correct character encoding, as specified by the Content-Type HTTP header.[43]

[43] UTF-16 works under Tomcat 3.2.x but fails under Tomcat 4.0 beta 5. Hopefully this will be addressed in later versions of Tomcat.

8.6.4. I18n Troubleshooting Checklist

Here are a few things to consider when problems occur. First, rule out obvious problems:

Visit a web site that uses the language you are trying to produce. For example, http://www.chinadaily.com.cn/ has an option to view the site in Chinese. This will confirm that your browser loads the correct fonts.
Test your application with English XML data and XSLT stylesheets to verify that the transformations are performed correctly.
Perform the XSLT transformation on the command line. Save the result to a file and view with a Unicode-compatible text editor. If all else fails, view with a binary editor to see how the characters are being encoded.
Verify that your XML parser supports the encodings you are trying to parse.[44]

[44] Encoding supported by Apache's Xerces parser are documented at http://xml.apache.org/xerces-j/faq-general.html.

If these tests do not uncover the problem, try the following:

Stick with UTF-8 encoding until problems are resolved. This is the most compatible encoding.
Verify that the servlet sets the Content-Type header to:
```
Content-Type: text/html; charset=UTF-8
```

Verify that the XSLT stylesheet sets the appropriate encoding on the <xsl:output> element or override the encoding programmatically:
```
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
```

Insert some code into the servlet that performs the transformation but sends the result to a file instead of the HttpServletResponse's Writer. Inspect this file with a Unicode-compatible text editor.
Use java.io.File or java.io.InputStream instead of java.io.Reader when reading XML and XSLT files.

8.6. Internationalization with XSLT

8.6.1. XSLT Stylesheet Design

Example 8-16. directory.xml

Figure 8-6. English XSLT output

Example 8-17. directory_basic.xslt

Example 8-18. directory_en.xslt

Example 8-19. directory_es.xslt

Figure 8-7. Spanish output

8.6.2. Encodings

8.6.2.1. Creating the XML and XSLT

Figure 8-8. Error dialog