Converting Between Character Sets (XML in a Nutshell, 2nd Edition)

5.8. Converting Between Character Sets

The ultimate solution to this character set morass is to use Unicode in either UTF-16 or UTF-8 format for all your XML documents. An increasing number of tools support one of these two formats natively; even the unassuming Notepad offers an option to save files in Unicode in Windows NT 4.0 and 2000. Microsoft Word 97 and later saves the text of its documents in Unicode, though unlike XML documents, Word files are hardly pure text. Much of the binary data in a Word file is not Unicode or any other kind of text. However, Word 2000 can actually save plain text files into Unicode and is the authors' Unicode editor of choice these days when we need to type a document in several different, complex scripts. To save Word 2000 as plain Unicode text, select the format Encoded Text from the Save As Type: Choice menu in Word's Save As dialog box. Then select one of the four Unicode formats in the resulting File Conversion dialog box.

Nonetheless, most of our current tools are still adapted primarily for vendor-specific character sets that can't handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.

Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. On Unix, the native character set is likely one of the standard ISO character sets, and you can save into that format directly. On the Mac, you can avoid problems if you stick to pure ASCII documents. On Windows, you can go a little further and use Latin-1, if you're careful to stay away from the extra characters that aren't part of the official ISO-8859-1 specification. Otherwise, you'll have to convert your document from its native, platform-dependent encoding to one of the standard platform-independent character sets.

François Pinard has written an open source character-set conversion tool called recode for Linux and Unix, which you can download from http://www.iro.umontreal.ca/contrib/recode/, as well as the usual GNU mirror sites. Wojciech Galazka has ported recode to DOS. See ftp://ftp.simtel.net/pub/simtelnet/gnu/djgpp/v2gnu/rcode34b.zip. You can also use the Java Development Kit's native2ascii tool at http://java.sun.com/products/jdk/1.3/docs/tooldocs/win32/native2ascii.html. First convert the file from its native encoding to Java's special ASCII-encoded Unicode format, then use the same tool in reverse to convert from the Java format to the encoding you actually want. For example, to convert the file myfile.xml from the Windows Cp1252 encoding to UTF-8, execute these two commands in sequence:

% native2ascii -encoding Cp1252 myfile.xml myfile.jtx
% native2ascii -reverse -encoding UTF-8 myfile.jtx myfile.xml