9.6 Unicode

Plain strings are converted into Unicode strings either explicitly, with the unicode built-in, or implicitly, when you pass a plain string to a function that expects Unicode. In either case, the conversion is done by an auxiliary object known as a codec (for coder-decoder). A codec can also convert Unicode strings to plain strings either explicitly, with the encode method of Unicode strings, or implicitly.

You identify a codec by passing the codec name to unicode or encode. When you pass no codec name and for implicit conversion, Python uses a default encoding, normally 'ascii'. (You can change the default encoding in the startup phase of a Python program, as covered in Chapter 13; see also setdefaultencoding in Chapter 8.) Every conversion has an explicit or implicit argument errors, a string specifying how conversion errors are to be handled. The default is 'strict', meaning any error raises an exception. When errors is 'replace', the conversion replaces each character causing an error with '?' in a plain-string result or with u'\ufffd' in a Unicode result. When errors is 'ignore', the conversion silently skips characters that cause errors.

9.6.1 The codecs Module

The mapping of codec names to codec objects is handled by the codecs module. This module lets you develop your own codec objects and register them so that they can be looked up by name, just like built-in codecs. Module codecs also lets you look up any codec explicitly, obtaining the functions the codec uses for encoding and decoding, as well as factory functions to wrap file-like objects. Such advanced facilities of module codecs are rarely used, and are not covered further in this book.

The codecs module, together with the encodings package, supplies built-in codecs useful to Python developers dealing with internationalization issues. Any supplied codec can be installed as the default by module sitecustomize, or can be specified by name when converting explicitly between plain and Unicode strings. The codec normally installed by default is 'ascii', which accepts only characters with codes between 0 and 127, the 7-bit range of the American Standard Code for Information Interchange (ASCII) that is common to most encodings. A popular codec is 'latin-1', a fast, built-in implementation of the ISO 8859-1 encoding that offers a one-byte-per-character encoding of all special characters needed for Western European languages.

The codecs module also supplies codecs implemented in Python for most ISO 8859 encodings, with codec names from 'iso8859-1' to 'iso8859-15'. On Windows systems only, the codec named 'mbcs' wraps the platform's multibyte character set conversion procedures. In Python 2.2, many codecs are added to support Asian languages. Module codecs also supplies several standard code pages (codec names from 'cp037' to 'cp1258'), Mac-specific encodings (codec names from 'mac-cyrillic' to 'mac-turkish'), and Unicode standard encodings 'utf-8' and 'utf-16' (the latter also have specific big-endian and little-endian variants 'utf-16-be' and 'utf-16-le'). For use with UTF-16, module codecs also supplies attributes BOM_BE and BOM_LE, byte-order marks for big-endian and little-endian machines respectively, and BOM, byte-order mark for the current platform.

Module codecs also supplies two functions to make it easier to deal with encoded text during input/output operations.

EncodedFile

EncodedFile(file,datacodec,filecodec=None,errors='strict')

Wraps the file-like object file, returning another file-like object ef that implicitly and transparently applies the given encodings to all data read from or written to the file. When you write a string s to ef, ef first decodes s with the codec named by datacodec, then encodes the result with the codec named by filecodec, and lastly writes it to file. When you read a string, ef applies filecodec first, then datacodec. When filecodec is None, ef uses datacodec for both steps in either direction.

For example, if you want to write strings that are encoded in latin-1 to sys.stdout and have the strings come out in utf-8, use the following:

import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout,'latin-1',
                                                     'utf-8')

open

open(filename,mode='rb',encoding=None,errors='strict',
     buffering=1)

Uses the built-in function open (covered in Chapter 10) to supply a file-like object that accepts and/or provides Unicode strings to/from Python client code, while the underlying file can either be in Unicode (when encoding is None) or use the codec named by encoding. For example, if you want to write Unicode strings to file uni.txt and have the strings implicitly encoded as latin-1 in the file, replacing with '?' any character that cannot be encoded in Latin-1, use the following:

import codecs
flout = codecs.open('uni.txt','w','latin-1','replace')

# now you can write Unicode strings directly to flout
flout.write(u'élève')
flout.close(  )

9.6.2 The unicodedata Module

The unicodedata module supplies easy access to the Unicode Character Database. Given any Unicode character, you can use functions supplied by module unicodedata to obtain the character's Unicode category, official name (if any), and other, more exotic information. You can also look up the Unicode character (if any) corresponding to a given official name. Such advanced facilities are rarely needed, and are not covered further in this book.