9.6 Unicode
Plain strings are converted into Unicode
strings either explicitly, with the unicode
built-in, or implicitly, when you pass a plain string to a function
that expects Unicode. In either case, the conversion is done by an
auxiliary object known as a codec (for
coder-decoder). A codec can also convert Unicode strings to plain
strings either explicitly, with the encode method
of Unicode strings, or
implicitly.
You
identify a codec by passing the codec name to
unicode or encode. When you
pass no codec name and for implicit conversion, Python uses a default
encoding, normally 'ascii'. (You can change the
default encoding in the startup phase of a Python program, as covered
in Chapter 13; see also
setdefaultencoding in Chapter 8.) Every conversion has an explicit or implicit
argument errors, a string specifying how
conversion errors are to be handled. The default is
'strict', meaning any error raises an exception.
When errors is
'replace', the conversion replaces each character
causing an error with '?' in a plain-string result
or with u'\ufffd' in a Unicode result. When
errors is 'ignore', the
conversion silently skips characters that cause
errors.
9.6.1 The codecs Module
The mapping of codec names to codec
objects is handled by the codecs module. This
module lets you develop your own codec objects and register them so
that they can be looked up by name, just like built-in codecs. Module
codecs also lets you look up any codec explicitly,
obtaining the functions the codec uses for encoding and decoding, as
well as factory functions to wrap file-like objects. Such advanced
facilities of module codecs are rarely used, and
are not covered further in this book.
The codecs module,
together with the encodings package, supplies
built-in codecs useful to Python developers dealing with
internationalization issues. Any supplied codec can be installed as
the default by module sitecustomize, or can be
specified by name when converting explicitly between plain and
Unicode strings. The codec normally installed by default is
'ascii', which accepts only characters with codes
between 0 and 127, the 7-bit range of the American Standard Code for
Information Interchange (ASCII) that is common to most encodings. A
popular codec is 'latin-1', a fast, built-in
implementation of the ISO 8859-1 encoding that offers a
one-byte-per-character encoding of all special characters needed for
Western European languages.
The
codecs module also supplies codecs implemented in
Python for most ISO 8859 encodings, with codec names from
'iso8859-1' to 'iso8859-15'. On
Windows systems only, the codec named 'mbcs' wraps
the platform's multibyte character set conversion
procedures. In Python 2.2, many codecs are added to support Asian
languages. Module codecs also supplies several
standard code pages (codec names from 'cp037' to
'cp1258'), Mac-specific encodings (codec names
from 'mac-cyrillic' to
'mac-turkish'), and Unicode standard encodings
'utf-8' and 'utf-16' (the
latter also have specific big-endian and little-endian variants
'utf-16-be' and 'utf-16-le').
For use with UTF-16, module codecs also supplies
attributes BOM_BE and BOM_LE,
byte-order marks for big-endian and little-endian machines
respectively, and BOM, byte-order mark for the
current platform.
Module codecs also supplies two functions to make
it easier to deal with encoded text during input/output operations.
EncodedFile(file,datacodec,filecodec=None,errors='strict')
|
|
Wraps the file-like object file, returning
another file-like object ef that
implicitly and transparently applies the given encodings to all data
read from or written to the file. When you write a string
s to ef,
ef first decodes
s with the codec named by
datacodec, then encodes the result with
the codec named by filecodec, and lastly
writes it to file. When you read a string,
ef applies
filecodec first, then
datacodec. When
filecodec is None,
ef uses
datacodec for both steps in either
direction.
For example, if you want to write strings that are encoded in
latin-1 to sys.stdout and have
the strings come out in utf-8, use the following:
import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout,'latin-1',
'utf-8')
open(filename,mode='rb',encoding=None,errors='strict',
buffering=1)
|
|
Uses the built-in function open (covered in Chapter 10) to supply a file-like object that accepts
and/or provides Unicode strings to/from Python client code, while the
underlying file can either be in Unicode (when
encoding is None) or
use the codec named by encoding. For
example, if you want to write Unicode strings to file
uni.txt and have the strings implicitly encoded
as latin-1 in the file, replacing with
'?' any character that cannot be encoded in
Latin-1, use the following:
import codecs
flout = codecs.open('uni.txt','w','latin-1','replace')
# now you can write Unicode strings directly to flout
flout.write(u'élève')
flout.close( )
9.6.2 The unicodedata Module
The unicodedata
module supplies easy access to the Unicode Character Database. Given
any Unicode character, you can use functions supplied by module
unicodedata to obtain the
character's Unicode category, official name (if
any), and other, more exotic information. You can also look up the
Unicode character (if any) corresponding to a given official name.
Such advanced facilities are rarely needed, and are not covered
further in this book.
|