11.1 Serialization
Python
supplies a number of modules that deal with I/O operations that
serialize (save) entire Python objects to various kinds of byte
streams, and deserialize (load and recreate) Python objects back from
such streams. Serialization is also called
marshaling.
11.1.1 The marshal Module
The marshal module
supports the specific serialization tasks needed to save and reload
compiled Python files (.pyc and
.pyo). marshal only handles
instances of fundamental built-in data types:
None, numbers (plain and long integers, float,
complex), strings (plain and Unicode), code objects, and built-in
containers (tuples, lists, dictionaries) whose items are instances of
elementary types. marshal does not handle
instances of user-defined types, nor classes and instances of
classes. marshal is faster than other
serialization modules. Code objects are supported only by
marshal, not by other serialization modules.
Module marshal supplies the following functions.
dump(value,fileobj)
dumps(value)
|
|
dumps returns a string representing object
value. dump writes the
same string to file object fileobj, which
must be opened for writing in binary mode.
dump(v,f)
is just like
f.write(dumps(v)).
fileobj cannot be a file-like object: it
must be an instance of type file.
loads creates and returns the object
v previously dumped to string
str, so that, for any object
v of a supported type,
v equals
loads(dumps(v)).
If str is longer than
dumps(v),
loads ignores the extra bytes.
load reads the right number of bytes from file
object fileobj, which must be opened for
reading in binary mode, and creates and returns the object
v represented by those bytes.
fileobj cannot be a file-like object: it
must be an instance of type file.
Functions load and dump are
complementary. In other words, a sequence of calls to
load(f)
deserializes the same values previously serialized when
f's contents were created
by a sequence of calls to
dump(v,f).
Objects that are dumped and loaded in this way can be instances of
any mix of supported types.
Suppose you need to analyze several text files, whose names are given
as your program's arguments, and record where each
word appears in those files. The data you need to record for each
word is a list of
(filename,
line-number) pairs. The
following example uses marshal to encode lists of
(filename,
line-number) pairs as
strings and store them in a DBM-like file (as covered later in this
chapter). Since those lists contain tuples, each made up of a string
and a number, they are within
marshal's abilities to serialize.
import fileinput, marshal, anydbm
wordPos = { }
for line in fileinput.input( ):
pos = fileinput.filename( ), fileinput.filelineno( )
for word in line.split( ):
wordPos.setdefault(word,[ ]).append(pos)
dbmOut = anydbm.open('indexfilem','n')
for word in wordPos:
dbmOut[word] = marshal.dumps(wordPos[word])
dbmOut.close( )
We also need marshal to read back the data stored
to the DBM-like file indexfilem, as shown in the
following example:
import sys, marshal, anydbm, linecache
dbmIn = anydbm.open('indexfilem')
for word in sys.argv[1:]:
if not dbmIn.has_key(word):
sys.stderr.write('Word %r not found in index file\n' % word)
continue
places = marshal.loads(dbmIn[word])
for fname, lineno in places:
print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)
print linecache.getline(fname, lineno),
11.1.2 The pickle and cPickle Modules
The pickle and
cPickle modules supply factory functions, named
Pickler and Unpickler, to
generate objects that wrap file-like objects and supply serialization
mechanisms. Serializing and deserializing via these modules is also
known as pickling and
unpickling. The difference between the modules
is that in pickle, Pickler and
Unpickler are classes, so you can inherit from
these classes to create customized serializer objects, overriding
methods as needed. In cPickle,
Pickler and Unpickler are
factory functions, generating instances of special-purpose types, not
classes. Performance is therefore much better with
cPickle, but inheritance is not feasible. In the
rest of this section, I'll be talking about module
pickle, but everything applies to
cPickle
too.
Note that in releases of Python older than the ones covered in this
book, unpickling from an untrusted data source was a security
risk—an attacker could exploit this to execute arbitrary code.
No such weaknesses are known in Python 2.1 and
later.
Serialization shares some of the issues of deep copying, covered in
Section 8.5 in Chapter 8. Module pickle deals with
these issues in much the same way as module copy
does. Serialization, like deep copying, implies a recursive walk over
a directed graph of references. pickle preserves
the graph's shape when the same object is
encountered more than once, meaning that the object is serialized
only the first time, and other occurrences of the same object
serialize references to a single copy. pickle also
correctly serializes graphs with reference cycles. However, this
implies that if a mutable object o is
serialized more than once to the same Pickler
instance p, any changes to
o after the first serialization of
o to p are not
saved. For clarity and simplicity, I recommend you avoid altering
objects that are being serialized while serialization to a single
Pickler instance is in progress.
pickle can serialize in either an ASCII format or
a compact binary one. Although the ASCII format is the default for
backward compatibility, you should normally request binary format, as
it saves both time and storage space. When you reload objects,
pickle transparently recognizes and uses either
format. I recommend you always specify binary format: the size and
speed savings can be substantial, and binary format has basically no
downside except loss of compatibility with very old versions of
Python.
pickle serializes classes and functions by name,
not by value. pickle can therefore deserialize a
class or function only by importing it from the same module where the
class or function was found when pickle serialized
it. In particular, pickle can serialize and
deserialize classes and functions only if they are top-level names
for their module (i.e., attributes of their module). For example,
consider the following:
def adder(augend):
def inner(addend, augend=augend): return addend+augend
return inner
plus5 = adder(5)
This code binds a closure to name plus5 (as
covered in Section 4.10.6.2 in Chapter 4), which is a
nested function inner plus an appropriate nested
scope. Therefore, trying to pickle plus5 raises a
pickle.PicklingError exception: a function can be
pickled only when it is top-level, and function
inner, whose closure is bound to name
plus5 in this code, is not top-level, but rather
nested inside function adder. Similar issues apply
to other uses of nested functions, and also to nested classes (i.e.,
classes that are not top-level).
11.1.2.1 Functions of pickle and cPickle
Modules pickle
and cPickle expose the following functions.
dump(value,fileobj,bin=0)
dumps(value,bin=0)
|
|
dumps returns a string representing object
value. dump writes the
same string to file-like object fileobj,
which must be opened for writing.
dump(v,f,bin)
is like
f.write(dumps(v,bin)).
If bin is true, dump
uses binary format, so f must be open in
binary mode.
dump(v,f,bin)
is also like
Pickler(f,bin).dump(v).
loads creates and returns the object
v represented by string
str, so that for any object
v of a supported type,
v=
=loads(dumps(v)).
If str is longer than
dumps(v),
loads ignores the extra bytes.
load reads the right number of bytes from
file-like object fileobj and creates and
returns the object v represented by those
bytes. If two calls to dump are made in sequence
on the same file, two later calls to load from
that file deserialize the two objects that dump
serialized. load and loads
transparently support pickles performed in either binary or ASCII
mode. If data is pickled in binary format, the file must be open in
binary format for both dump and
load.
load(f)
is like
Unpickler(f).load(
).
Creates and returns an object
p such that calling
p.dump is equivalent to
calling function dump with the
fileobj and bin
argument values passed to Pickler. To serialize
many objects to a file, Pickler is more convenient
and faster than repeated calls to dump. You can
subclass pickle.Pickler to override
Pickler methods (particularly method
persistent_id) and create your own persistence
framework. However, this is an advanced issue, and is not covered
further in this book.
Creates and returns an object u such that
calling u.load is
equivalent to calling function load with the
fileobj argument value passed to
Unpickler. To deserialize many objects from a
file, Unpickler is more convenient and faster than
repeated calls to function load. You can subclass
pickle.Unpickler to override
Unpickler methods (particularly the method
persistent_load) and create your own persistence
framework. However, this is an advanced issue, and is not covered
further in this book.
11.1.2.2 A pickling example
The following example handles the same task as the
marshal example shown earlier, but uses
cPickle instead of marshal to
encode lists of
(filename,
line-number) pairs as
strings:
import fileinput, cPickle, anydbm
wordPos = { }
for line in fileinput.input( ):
pos = fileinput.filename( ), fileinput.filelineno( )
for word in line.split( ):
wordPos.setdefault(word,[ ]).append(pos)
dbmOut = anydbm.open('indexfilep','n')
for word in wordPos:
dbmOut[word] = cPickle.dumps(wordPos[word], 1)
dbmOut.close( )
We can use either cPickle or
pickle to read back the data stored to the
DBM-like file indexfilep, as shown in the
following example:
import sys, cPickle, anydbm, linecache
dbmIn = anydbm.open('indexfilep')
for word in sys.argv[1:]:
if not dbmIn.has_key(word):
sys.stderr.write('Word %r not found in index file\n' % word)
continue
places = cPickle.loads(dbmIn[word])
for fname, lineno in places:
print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)
print linecache.getline(fname, lineno),
11.1.2.3 Pickling instance objects
In order for pickle to reload an instance object
x, pickle must be able
to import x's class from
the same module in which the class was defined when
pickle saved the instance. By default, to save the
instance-specific state of x,
pickle saves
x._ _dict_ _, and then,
to restore state, reloads x._
_dict_ _. Therefore, all instance attributes (values in
x._ _dict_ _) must be
instances of types suitable for pickling and unpickling (i.e., a
pickleable object). A class can supply special methods to control
this process.
By
default, pickle does not call
x._ _init_ _ to restore
instance object x. If you do want
pickle to call
x._ _init_ _,
x's class must supply the
special method _ _getinitargs_ _. In this case,
when pickle saves x,
pickle then calls
x._ _getinitargs_ _( ),
which must return a tuple t. When
pickle later reloads x,
pickle calls
x._ _init_
_(*t) (i.e.,
the items of tuple t are passed as
positional arguments to x._
_init_ _). When x._
_init_ _ returns, pickle restores
x._ _dict_ _,
overriding attribute values bound by
x._ _init_ _. Method
_ _getinitargs_ _ is therefore useful only when
x._ _init_ _ has other
tasks to perform in addition to the task of giving initial values to
x's attributes.
When x's class has a
special method _ _getstate_ _,
pickle calls
x._ _getstate_ _( ),
which normally returns a dictionary d.
pickle saves d instead
of x._ _dict_ _. When
pickle later reloads x,
it sets x._ _dict_ _
from d. When
x's class supplies
special method _ _setstate_ _,
pickle calls
x._ _setstate_
_(d) for
whatever d was saved, rather than
x._ _dict_
_.update(d).
When x's class supplies
both methods _ _getstate_ _ and _
_setstate_ _, _ _getstate_ _ may return
any pickleable object y, not just a
dictionary, since pickle reloads
x by calling
x._ _setstate_
_(y). A
dictionary is often the handiest type of object for this purpose. As
mentioned in "The copy Module" in
Chapter 8, special methods _
_getinitargs_ _, _ _getstate_ _, and
_ _setstate_ _ are also used to control the way
instance objects are copied and deep-copied. If a new-style class
defines _ _slots_ _, the class should also define
_ _getstate_ _ and _ _setstate_
_, otherwise the class's instances are not
pickleable.
11.1.2.4 Pickling customization with the copy_reg module
You can control how
pickle serializes and deserializes objects of an
arbitrary type (not class) by registering factory and reduction
functions with module copy_reg. This is useful
when you define a type in a C-coded Python extension. Module
copy_reg supplies the following functions.
Adds fcon to the table of safe
constructors, which lists all factory functions that
pickle may call. fcon
must be callable, and is normally a function.
pickle(type,fred,fcon=None)
|
|
Registers function fred as the reduction
function for type type, where
type must be a type object (not a class).
To save any object o of type
type, module pickle
calls
fred(o)
and saves fred's result.
fred(o)
must return a pair
(fcon,t)
or a tuple
(fcon,t,d),
where fcon is a safe constructor and
t is a tuple. To reload
o, pickle calls
o=fcon(*t).
Then, if fred returned a
d, pickle uses
d to restore
o's state, as in
"Pickling of instance objects"
(o._ _setstate_
_(d) if
o supplies _ _setstate_
_, otherwise o._ _dict_
_.update(d)).
If fcon is not None,
pickle also calls
constructor(fcon)
to register fcon as a safe
constructor.
11.1.3 The shelve Module
The shelve module
orchestrates modules cPickle (or
pickle, when cPickle is not
available in the current Python installation),
cStringIO (or StringIO, when
cStringIO is not available in the current Python
installation), and anydbm (and its underlying
modules for access to DBM-like archive files, as discussed later in
this chapter) in order to provide a lightweight persistence
mechanism.
shelve supplies a function open
that is polymorphic to anydbm.open. The mapping
object s returned by
shelve.open is less limited than the mapping
object a returned by
anydbm.open.
a's keys and values must
be strings. s's keys must
also be strings, but s's
values may be of any type or class that pickle can
save and restore. pickle customizations (e.g.,
copy_reg, _ _getinitargs_ _,
_ _getstate_ _, and _ _setstate_
_) also apply to shelve, since
shelve delegates serialization to
pickle.
Beware a subtle trap when you use shelve and
mutable objects. When you operate on a mutable object held in a
shelf, the changes don't take unless you assign the
changed object back to the same index. For
example:
import shelve
s = shelve.open('data')
s['akey'] = range(4)
print s['akey'] # prints: [0, 1, 2, 3]
s['akey'].append('moreover') # trying direct mutation
print s['akey'] # doesn't take; prints: [0, 1, 2, 3]
x = s['akey'] # fetch the object
x.append('moreover') # perform mutation
s['akey'] = x # store the object back
print s['akey'] # now it takes, prints: [0, 1, 2, 3, 'moreover']
The following example handles the same task as the pickling example
earlier, but uses shelve to persist lists of
(filename,
line-number) pairs:
import fileinput, shelve
wordPos = { }
for line in fileinput.input( ):
pos = fileinput.filename( ), fileinput.filelineno( )
for word in line.split( ):
wordPos.setdefault(word,[ ]).append(pos)
shOut = shelve.open('indexfiles','n')
for word in wordPos:
shOut[word] = wordPos[word]
shOut.close( )
We must use shelve to read back the data stored to
the DBM-like file indexfiles, as shown in the
following example:
import sys, shelve, linecache
shIn = shelve.open('indexfiles')
for word in sys.argv[1:]:
if not shIn.has_key(word):
sys.stderr.write('Word %r not found in index file\n' % word)
continue
places = shIn[word]
for fname, lineno in places:
print "Word %r occurs in line %s of file %s:" % (word,lineno,fname)
print linecache.getline(fname, lineno),
These two examples are the simplest and most direct of the various
equivalent pairs of examples shown throughout this section. This
reflects the fact that module shelve is higher
level than the modules used in previous examples.
|