17.4 Optimization
"First
make it work. Then make it right. Then make it
fast." This quotation, often with slight variations,
is widely known as the golden rule of programming. As far as
I've been able to ascertain, the quotation is
attributed to Kent Beck, who credits his father with it. Being widely
known makes the principle no less important, particularly because
it's more honored in the breach than in the
observance. A negative form, slightly exaggerated for emphasis, is in
a quotation by Don Knuth: "Premature optimization is
the root of all evil in programming."
Optimization is premature if your code is not working yet. First make
it work. Optimization is also premature if your code is working but
you are not satisfied with the overall architecture and design.
Remedy structural flaws before worrying about optimization: first
make it work, then make it right. These first two steps are not
optional—working, well-architected code is always a must.
In contrast, you don't always need to make it fast.
Benchmarks may show that your code's performance is
already acceptable after the first two steps. When performance is not
acceptable, profiling often shows that all performance issues are in
a small subset, perhaps 10% to 20% of the code where your program
spends 80% or 90% of the time. Such performance-crucial regions of
your code are also known as its bottlenecks, or
hot spots. It's a waste of
effort to optimize large portions of code that account for, say, 10%
of your program's running time. Even if you made
that part run 10 times as fast (a rare feat), your
program's overall runtime would only decrease by 9%,
a speedup no user will even notice. If optimization is needed, focus
your efforts where they'll matter, on bottlenecks.
You can optimize bottlenecks while keeping your code 100% pure
Python. In some cases, you can resort to recoding some computational
bottlenecks as Python extensions, potentially gaining even better
performance.
17.4.1 Developing a Fast-Enough Python Application
Start by designing,
coding, and testing your application in Python, often using some
already available extension modules. This takes much less time than
it would take with a classic compiled language. Then benchmark the
application to find out if the resulting code is fast enough. Often
it is, and you're done—congratulations!
Since much of Python itself is coded in highly optimized C, as are
many of its standard and extension modules, your application may even
turn out to be already faster than typical C code. However, if the
application is too slow, you need to re-examine your algorithms and
data structures. Check for bottlenecks due to application
architecture, network traffic, database access, and operating system
interactions. For typical applications, each of these factors is more
likely than language choice to cause slowdowns. Tinkering with
large-scale architectural aspects can often speed up an application
dramatically, and Python is an excellent medium for such
experimentation.
If your program is still too slow, you should profile it to find out
where the time is going. Applications often exhibit computational
bottlenecks—small areas of the source code, generally between
10% and 20%, which account for 80% or more of the running time. You
can now optimize the bottlenecks, applying the techniques suggested
in the rest of this chapter.
If normal Python-level optimizations still leave some outstanding
computational bottlenecks, you can recode them as Python extension
modules, as covered in Chapter 24. In the end, your
application will run at roughly the same speed as if you had coded it
all in C, C++, or Fortran—or faster, when large-scale
experimentation has let you find a better architecture. Your overall
programming productivity with this process is not much less than if
you coded everything in Python. Future changes and maintenance are
easy, since you use Python to express the overall structure of the
program, and lower-level, harder-to-maintain languages only for a few
specific computational bottlenecks.
As you produce applications in a given area according to this
process, you will accumulate a library of reusable Python extension
modules for that area. You therefore become more and more productive
at developing other fast-running Python applications in the same
field.
Even if external constraints should eventually force you to recode
the whole application in a lower-level language,
you're still better off for having started in
Python. Rapid prototyping has long been acknowledged as the best way
to get a software architecture just right. A working prototype lets
you check that you have identified the right problems and taken the
best path to their solution. A prototype affords the kind of
large-scale architectural experimentation that can make a real
difference to performance. Starting your prototype with Python allows
a gradual migration to other languages by way of extension modules.
The application remains in a fully functional and testable state at
each stage. This ensures against the risk of compromising a
design's architectural integrity in the coding
stage. The resulting software is likely to be faster and more robust
than if all of the coding had been lower-level from the start, and
your productivity, while not quite as good as with a pure Python or
mostly Python application, is still better than if you had been
coding at a lower level throughout.
17.4.2 Benchmarking
Benchmarking is
similar to system testing: both activities are like running the
program as it's meant to be run for production
purposes. In both cases, you need to have at least some subset of the
program's intended functionality working, and you
need to use known, reproducible inputs. In the case of benchmarking,
you don't need to capture and check your
program's output: since you make it work and make it
right before you make it fast, you are already confident about your
program's correctness by the time you benchmark it.
You do need inputs that are representative of typical system
operations, particularly those that may be most challenging for your
program's performance. If your program performs
several kinds of operations, make sure you run one or two benchmarks
for each different kind of operation.
Elapsed time as measured by your wristwatch is probably precise
enough to benchmark most programs. Programs with hard real-time
constraints are obviously another matter, but they have needs very
different from those of normal programs in most respects. A 5% or 10%
difference in performance, except for programs with very peculiar
constraints, makes no practical difference to a
program's real-life usability.
When you benchmark "toy" programs
in order to help you choose an algorithm or data structure, you may
need more precision. In that case, you may want to set up an
artificial environment, with a machine as quiescent as possible, no
network activity, and accurate timekeeping. Python time operations
are covered in Chapter 12. The benchmarking
discussed in this section is a different kind of issue: an
approximation of real-life program operation, for the sole purpose of
checking whether the program's performance at each
task is acceptable, before embarking on profiling and other
optimization activities. For such system benchmarking, a situation
that approximates the program's normal operating
conditions is best, and accuracy in timing is not particularly
important.
17.4.3 Large-Scale Optimization
The
aspects of your program that are most important for performance are
large-scale ones: choice of algorithms, overall architecture, and
choice of data
structures.
The performance issues that you must almost always take into account
are those connected with the traditional big-O notation of computer
science. Informally, if you call N the input size
of an algorithm, big-O notation expresses algorithm performance, for
large values of N, as proportional to a function
of N (in precise computer science lingo, this
should actually be called big-Theta, but in practical use programmers
in the field call this big-O). An O(N) algorithm
is one where, for large enough N, handling twice
as much data takes about twice as much time, three times as much data
three times as much time, and so on, growing linearly with
N. An
O(N2)
algorithm is one where, for large enough N,
handling twice as much data takes about four times as much time,
three times as much data nine times as much time, and so on, growing
with N squared.
You will find more information on big-O notation, as well as other
issues about algorithms and their complexity, in any good book about
algorithms and data structures. Unfortunately, at the time of this
writing, there aren't yet any such books using
Python. However, if you are at least moderately familiar with C, I
can recommend Mastering
Algorithms with
C, by Kyle Loudon (O'Reilly).
To understand the practical importance of big-O considerations in
your programs, consider two different ways to accept all items from
an input iterator and accumulate them into a list in reverse order:
def slow(it):
result = [ ]
for item in it: result.insert(0, item)
return result
def fast(it):
result = [ ]
for item in it: result.append(item)
result.reverse( )
return result
We could express each of these functions more concisely, but the key
difference is best appreciated by presenting them in these elementary
terms. Function slow builds the result list by
inserting each input item before all previously received ones.
Function fast appends each input item after all
previously received ones, then reverses the result list just before
returning it. Intuitively, one might think that the final reversing
represents extra work, and therefore slow should
be faster than fast. But that's
not the way things work out.
Each call to result.append takes roughly the same
amount of time, independent of how many items are already in list
result, since there is always a free slot for an
extra item at the end of the list. The for loop in
function fast executes N times
to receive N items. Since each iteration of the
loop takes a constant time, overall loop time is
O(N). result.reverse also takes
time O(N), as it is directly proportional to the
total number of items. Thus, the total running time of
fast is also O(N). (If you
don't understand why a sum of two quantities, each
O(N), is also O(N), consider
that the sum of two linear functions of N is also
a linear function of N).
In contrast, each call to result.insert must make
space at slot 0 for the new item to insert, by
moving all items that are already in list result
forward one slot. That takes a time proportional to the number of
items that are already in the list. The overall amount of time to
receive N items is therefore proportional to
1+2+3+...N-1, a sum whose value is
O(N2).
Therefore, the total running time of slow is also
O(N2).
It's almost always worth replacing an
O(N2)
solution with an O(N) one, unless you can somehow
assign rigorous limits to the input size N. If
N can grow without bounds, the
O(N2)
solution will inevitably turn out to be disastrously slower than the
O(N) one for large enough values of
N, no matter what the proportionality constants in
each case may be (and no matter what profiling tells you). Unless you
have other
O(N2)
or even worse bottlenecks elsewhere that you cannot eliminate, a part
of the program that is
O(N2)
will inevitably turn into the program's bottleneck
and dominate runtime for large enough values of N.
Do yourself a favor and watch out for the big O: all other
performance issues, in comparison, are insignificant.
Incidentally, function fast can be made
substantially faster by expressing it in more idiomatic Python. Just
replace the first two lines with the single statement:
result = list(it)
This change does not affect
fast's big-O character
(fast is still O(N) after the
change), but does speed things up by a constant factor. Often, in
Python, the simplest, clearest, most idiomatic way to express
something is also the fastest.
Choosing algorithms with good big-O characteristics is roughly the
same task in Python as in any other language. You just need a few
indications about the big-O performance of Python's
elementary building blocks.
17.4.3.1 List operations
Python lists are internally implemented
with vectors (also known as arrays), not with linked lists. This
fundamental implementation choice determines just about all
performance characteristics of Python lists, in big-O terms.
Chaining two lists of length N1 and
N2 is O(N1+N2). Multiplying a
list of length N by the number
M is O(N*M). Accessing or
rebinding any list item is O(1) (also known as
constant time, meaning that the time taken does not depend on how
many items are in the list). len( ) on a list is
also O(1). Accessing any slice of length
M is O(M). Rebinding a slice of
length M with one of identical length is also
O(M). Rebinding a slice of length
M1 with one of different length
M2 is O(M1+M2+N1), where
N1 is the number of items after the slice in the
target list.
Most list methods, as shown way back in Table 4-3,
are equivalent to slice rebindings and have the same big-O
performance. Methods count,
index, remove, and
reverse, and operator in, are
O(N). Method sort is generally
O(N*log(N)), but has optimizations that let it be
O(N) in some important special cases, like when
the list is already sorted, reverse sorted, or sorted except for a
few items at the end (in Python 2.3, sort will
also be O(N) in a few more important special
cases). range(a,b,c) is
O((b-a)/c). xrange(a,b,c) is
O(1), but looping on
xrange's result is
O((b-a)/c).
17.4.3.2 String operations
Most methods on a string of length
N (be it plain or Unicode) are
O(N). len(astring) is
O(1). The fastest way to produce a copy of a
string with transliterations and/or removal of specified characters
is the string's method translate.
The most practically important big-O consideration involving strings
is covered in Section 17.4.5 later in
this chapter.
17.4.3.3 Dictionary operations
Python dictionaries are
internally implemented with hash tables. This fundamental
implementation choice determines just about all performance
characteristics of Python dictionaries, in big-O terms.
Accessing, rebinding, adding, or removing a dictionary item is
generally O(1), as are methods
has_key, get,
setdefault, and popitem, and
operator in.
d1.update(d2)
is
O(len(d2)).
len(adict)
is O(1). Methods keys,
items, and values are
O(N). Methods iterkeys,
iteritems, and itervalues are
O(1), but looping on the iterators that those
methods return is O(N). When the keys in a
dictionary are instances of classes that define _ _hash_
_ and equality comparison methods, dictionary performance
is of course affected by those methods. The indications presented in
this paragraph are valid only if both hashing and equality comparison
are O(1).
17.4.4 Profiling
Most
programs have hot spots (i.e., regions of source code that account
for most of the time elapsed during a program run).
Don't try to guess where your
program's hot spots are;
programmers' intuition is notoriously unreliable in
this field. Use module profile to collect profile
data over one or more runs of your program, with known inputs. Then,
use module pstats to collate, interpret, and
display that profile data. To gain accuracy, you can calibrate the
Python profiler for your machine (i.e., determine what overhead
profiling incurs on your machine). Module profile
can then subtract this overhead from the times it measures so that
the profile data you collect is closer to reality.
17.4.4.1 The profile module
The profile module supplies one function you will
often use.
code is a string such as you could use
with statement exec, normally a call to the main
function of the program you're profiling.
filename is the path of a file that
run creates or rewrites with profile data. Usually
you call run a few times, specifying different
filenames, and possibly different arguments to your
program's main function, in order to exercise
various program parts proportionately. Then, you use module
pstats to display collated results.
You may call run without a
filename to obtain a summary report,
similar to the one module pstats could give you,
directly on standard output. However, this approach gives no control
at all over output format, nor does it offer any way to consolidate
several runs into one report. In practice, you rarely use this
feature: collecting profile data into files is generally preferable.
Module profile also supplies class
Profile, mentioned in the next section. By
instantiating Profile directly, you can access
advanced functionality, such as the ability to run a command in
specified local and global dictionaries. I do not cover such advanced
functionality of class profile.Profile further in
this book.
17.4.4.2 Calibration
To calibrate profile
for your machine, you need to use class Profile,
which module profile supplies and internally uses
in function run. An instance
p of Profile supplies
one method you use for calibration.
Loops N times, then returns a number that
is the profiling overhead per call on your machine.
N must be large if your machine is fast.
Call p.calibrate(10000)
a few times and check that the various numbers it returns are very
close to each other, then pick the smallest one of them. If the
numbers exhibit substantial variation, try again with larger values
of N.
The calibration procedure can be time consuming. However, you need to
perform it only once, repeating it only when you make changes that
could alter your machine's characteristics, such as
applying patches to your operating system, adding memory, or changing
Python version. Once you know your machine's
overhead, you can tell profile about it each time
you import it, right before using profile.run. The
simplest way to do this is as follows:
import profile
profile.Profile.bias = ...the overhead you measured...
profile.run('main( )', 'somefile')
17.4.4.3 The pstats module
The pstats module
supplies a single class, Stats, that you use to
analyze, consolidate, and report on the profile data contained in one
or more files written by function profile.run.
class Stats(filename,*filenames)
|
|
Instantiates Stats with one or more filenames of
files of profile data written by function
profile.run.
An instance s of class
Stats provides methods to add profile data and
sort and output results. Each method returns
s, so you can chain several calls in the
same expression. s's main
methods are as follows.
Adds
another file of profile data to the set that
s is holding for analysis.
print_callees, print_callers |
|
s
.print_callees(*restrictions)
|
|
Outputs the list of functions in
s's profile data, sorted
according to the latest call to
s.sort_stats, and
subject to the given restrictions, if any. You can call each printing
method with zero or more restrictions,
which are applied one after the other, in order, to reduce the number
of output lines. A restriction that is an integer
n limits the output to the first
n lines. A restriction that is a
floating-point value f between
0.0 and 1.0 limits the output
to a fraction f of the lines. A
restriction that is a string is compiled as a regular expression (as
covered in Chapter 9); only lines satisfying a
search method call on the regular expressions are
output. Restrictions are cumulative. For example,
s.print_calls(10,0.5)
outputs the first 5 lines (half of 10). Output restrictions apply
only after the summary and header lines: summary and header are
output unconditionally.
Each function f that is output is
accompanied by the list of
f's callers (the
functions that called f) or
f's callees (the
functions that f called) according to the
name of the method.
s.print_stats(*restrictions)
|
|
Outputs statistics about
s's profile data, sorted
according to the latest call to
s.sort_stats, and
subject to the given restrictions, if any, as covered in
print_callees. After a few summary lines (date and
time on which profile data was collected, number of function calls,
and sort criteria used), the output, absent restrictions, is one line
per function, with six fields per line, labeled in a header line. For
each function f,
print_stats outputs six fields:
Total number of calls to function f
Total time spent in function f, exclusive
of other functions that f called
Total time per call (i.e., field 2 divided by
field 1)
Cumulative time spent in function f, and
in all functions directly or indirectly called from
f
Cumulative time per call (i.e., field 4 divided by
field 1)
The name of function f
Gives one or more keys (primary first, if more than one) on which to
sort future output. Each key is a string. The sort is descending for
keys indicating times or numbers, alphabetical (ascending) for key
'nfl'. The most frequently used keys when calling
sort_stats are:
- 'calls'
-
Number of calls to the function (like field 1
covered in print_stats)
- 'cumulative'
-
Cumulative time spent in the function and all functions it called
(like field 4 i covered in
print_stats)
- 'nfl'
-
Name of the function, its module, line number of the function in its
file (like field 6 covered in
print_stats)
- 'time'
-
Total time spent in the function itself, exclusive of functions it
called (like field 2 covered in
print_stats)
Alters s by stripping directory names from
all the module names that s holds, to make
future output more compact. s is unsorted
after s.strip_dirs( ),
and therefore you normally call
s.sort_stats with the
arguments you desire right after calling
s.strip_dirs.
17.4.5 Small-Scale Optimization
Fine
tuning of program operations is rarely important. Such tuning may
make a small but meaningful difference in some particularly hot spot,
but hardly ever is it a decisive factor. And yet, such fine tuning,
in the pursuit of mostly irrelevant microefficiencies, is where a
programmer's instincts are likely to lead. It is in
good part because of this that most optimization is premature and
best avoided. The most that can be said in favor of fine tuning is
that, if one idiom is always speedier than another when the
difference is measurable, it's worth getting into
the habit of always using the former and not the latter.
Most often, in Python, if you do what comes naturally and choose
simplicity and elegance, you end up with code that has good
performance as well as clarity and maintainability. In a few cases,
an approach that may not be intuitive offers performance advantages,
as discussed in the rest of this section.
The simplest possible optimization is to run your Python programs
using python -O or
-OO. -OO makes little direct
difference to performance compared to -O, but
-OO may save memory, since it removes docstrings
from the bytecode, and memory availability is sometimes (indirectly)
a performance bottleneck. The optimizer is not very powerful in
current releases of Python, but it may still gain you performance
advantages on the order of 10%, sometimes as large as 20%
(potentially even larger, if you make heavy use of
assert statements and if
_ _debug_ _: guards as suggested in Chapter 6). The best aspect of -O is
that it costs nothing—as long as your optimization
isn't premature, of course. -O
does impede use of debuggers, such as pdb, and may
thus make debugging somewhat harder if your program
isn't fully tested and working correctly. So,
don't use -O on a program
you're still developing.
17.4.5.1 Building up a string from pieces
The single Python
anti-idiom that's likeliest to kill your
program's performance, to the point that you should
never use it, is to build up a large string from pieces by looping on
string concatenation statements such as
big_string+=piece.
Since Python strings are immutable, such a concatenation makes Python
free the M bytes previously allocated for
big_string, and allocate and fill
M+K bytes for the new version. Doing this
repeatedly in a loop, you end up with roughly
O(N2)
performance, where N is the total number of
characters. More often than not,
O(N2)
performance where O(N) is available is a
performance disaster. On some platforms, things may be even bleaker
due to memory fragmentation effects caused by freeing many memory
areas, all of different sizes, and allocating progressively larger
ones.
To achieve O(N) performance, accumulate
intermediate pieces in a list rather than building up the string
piece by piece. Lists, unlike strings, are mutable, so appending to a
list has O(1) performance (amortized). Change each
occurrence of
big_string+=piece
into
temp_list.append(piece).
Then, when you're done accumulating, use the
following to build your desired string result in
O(N) time:
big_string = ''.join(temp_list)
Other O(N) ways to build up big strings are to
concatenate the pieces to an instance of
array.array('c'), or to write the pieces to an
instance of cStringIO.StringIO.
In the special case where you want to output the resulting string,
you may gain a further small slice of performance by using
writelines on temp_list
(never building big_string in memory).
When feasible (i.e., when you have the output file object open and
available in the loop), it's at least as effective
to perform a write call for each
piece, without any accumulation.
Although not nearly as crucial as += on a big
string in a loop, another case where removing string concatenation
may give a slight performance improvement is when
you're concatenating several values in an
expression:
oneway = str(x)+' eggs and '+str(y)+' slices of '+k+' ham'
another = '%s eggs and %s slices of %s ham' % (x, y, k)
Using operator % for string formatting is often a
good performance choice.
17.4.5.2 Searching and sorting
Operator
in, the most natural tool for searching, is
O(1) when the right-hand side operand is a
dictionary, but O(N) when the right-hand side
operand is a list. If you need to perform many searches on a
container, you're generally much better off using a
dictionary, rather than a list, as the container. Python dictionaries
are highly optimized for searching and fetching items by key.
Method sort of Python lists is also a highly
optimized and sophisticated tool. You can rely on
sort's performance. Performance
dramatically degrades, however, if you pass sort a
custom callable to perform comparisons in order to sort a list based
on anything but built-in comparisons. To satisfy such needs, consider
using the decorate-sort-undecorate (DSU) idiom instead. This idiom
has the following steps:
- decorate
-
Build an auxiliary list A where each item is a
tuple made up of the sort keys, ending with the item of the original
list L or with the item's index
- sort
-
Call A.sort( ) without arguments
- undecorate
-
Extract the items in order from the now-sorted A
The decorate and undecorate steps are most often handily performed
with list comprehensions. If you need the sort to be in-place, assign
the final sorted list to L[:]. Otherwise, DSU
provides a sorted copy, without disturbing the original list
L.
For example, say we have in L a large list of
strings, each of at least two words, and we want to sort
L in-place by the second word of each string:
A = [ (s.split( )[1], s) for s in L ]
A.sort( )
L[:] = [ t[1] for t in A ]
This is much faster than passing to L.sort a
function that compares two strings by their second words, as in:
def cmp2ndword(a, b): return cmp(a.split( )[1], b.split( )[1])
L.sort(cmp2ndword)
On a series of benchmarks with Python 2.2 on lists of 10,000 strings,
I measured the DSU version as 7 to 10 times faster than the non-DSU
one.
17.4.5.3 Avoiding exec and from ... import *
If a function contains an
exec statement without explicit dictionaries, the
whole function slows down substantially. The presence of such an
exec statement forces the Python compiler to avoid
the modest but precious optimizations it normally performs because
such an exec might cause any alteration at all to
the function's namespace. A from
statement of the form:
from MyModule import *
causes similar performance loss, since it, too, can alter a
function's namespace unpredictably.
exec itself is also quite slow, particularly if
you apply it to a string of source code rather than to a code object.
By far the best approach, for performance, for correctness, and for
clarity, is to avoid exec altogether.
It's most often possible to find better (faster,
more solid, and clearer) solutions. If you must use
exec, always use it with explicit dictionaries. If
you need to exec a dynamically obtained string
more than once, compile the string one time and
repeatedly exec the resulting code object.
eval works on expressions, not on statements;
therefore, although it's still slow, at least it
avoids some of the worst performance impacts of
exec. With eval, too,
you're best advised to use explicit dictionaries,
and, if you need repeated evaluation of the same dynamically obtained
string, compile the string just once, then
repeatedly eval the resulting code object.
17.4.5.4 Optimizing loops
Most of your
program's bottlenecks will be in loops, particularly
nested loops, because loop bodies often execute repeatedly. Python
does not implicitly perform any code hoisting: if you have any code
inside a loop that might be executed just once by hoisting it out of
the loop, and the loop is a performance bottleneck, hoist the code
out yourself. Sometimes the presence of code to hoist may not be
immediately obvious:
def slower(anobject, ahugenumber):
for i in xrange(ahugenumber): anobject.amethod(i)
def faster(anobject, ahugenumber):
themethod = anobject.amethod
for i in xrange(ahugenumber): themethod(i)
In this case, the code that faster hoists out of
the for loop is the attribute lookup
anobject.amethod. slower
repeats the lookup each and every time, while
faster performs it just once. The two functions
are not 100% equivalent: it is (just barely) conceivable that
executing amethod might cause such changes on
anobject that the next lookup for the same named
attribute fetches a different method object. This is part of why
Python doesn't perform such optimizations itself. In
practice, such subtle, obscure, and tricky cases happen far less than
one time in ten thousand. So you're quite safe
performing such optimizations yourself, when you're
trying to squeeze the last drop of performance out of some crucial
bottleneck.
It's faster for Python to use local variables than
global ones. So, if one of your loops is repeatedly accessing a
global variable whose value does not change between iterations of the
loop, put the value in a local variable and have the loop access the
local variable instead. This also applies to built-in functions:
def slightly_slower(asequence, adict):
for x in asequence: adict[x] = hex(x)
def slightly_faster(asequence, adict):
myhex = hex
for x in asequence: adict[x] = myhex(x)
Here, the speedup is very modest, on the order of 5% or so.
Do not cache None. None is
currently an ordinary built-in identifier, but it is scheduled to
become a keyword in Python 2.3 or 2.4, so no further optimization
will be needed.
List comprehensions can be faster than loops, and so can
map and filter. For
optimization purposes, try changing loops into list comprehensions or
map and filter calls where
feasible. However, the performance advantage of
map and filter is nullified if
you have to use a lambda or an extra level of
function call. Only when you pass to map or
filter a built-in function, or a function
you'd have to call anyway even from an explicit
loop, do you stand to gain.
The loops that you can replace most naturally with list
comprehensions, or map and
filter calls, are loops that build up a list by
repeatedly calling append on the list. In such
cases, if you know in advance the length of the resulting list, a
further optimization is available. Predefine the result list to the
right length (e.g., with result=[None]*N),
introduce an explicit index i that starts at
0 and grows by one at each iteration of the loop,
and change each call to result.append(x) into
result[i]=x. The following example shows this
optimization in the context of a typical microperformance benchmark
script:
import time
def slow(asequence):
result = [ ]
for x in asequence: result.append(-x)
return result
def middling(asequence):
return [ -x for x in asequence ]
def fast(asequence):
result = [None]*len(asequence)
for i in xrange(len(asequence)): result[i] = -asequence[i]
return result
biggie = xrange(500*1000)
tentimes = [None]*10
def timit(afunc):
lobi = biggie
start = time.clock( )
for x in tentimes: afunc(lobi)
stend = time.clock( )
return "%-10s: %.2f" % (afunc._ _name_ _, stend-start)
for afunc in slow, middling, fast, fast, middling, slow:
print timit(afunc)
Running this example with python
-O (on a PC with a 1.2 GHz Athlon CPU, Python
2.2.1) shows fast taking 4.30 seconds,
middling 4.81 to 4.84 seconds, and
slow 6.50 to 7.02 seconds, on Windows 98. The time
ranges on Linux are 4.19- 4.20, 5.15-5.20, and 6.91-7.00,
respectively. With the current alpha version of Python 2.3 on Linux,
the time ranges are 3.35-3.37 for fast, 4.61-4.64
for middling, and 6.43-6.44 for
slow. In summary, on this machine,
slow is 35%-40% slower than
middling, and middling is about
15%-25% slower than fast (and Python 2.2 is
10%-25% slower than the current alpha of Python 2.3).
17.4.5.5 Optimizing I/O
If
your program does substantial amounts of I/O, it's
likely that performance bottlenecks are due to I/O, not to
computation. Such programs are said to be I/O bound, rather than CPU
bound. Your operating system tries to optimize I/O performance, but
you can help it in a couple of ways. One such way is to perform your
I/O in chunks of a size that is optimal for performance, rather than
simply being convenient for your program's
operations. Another way is to use threading.
From the point of view of a program's convenience
and simplicity, the ideal amount of data to read or write at a time
is generally small (one character or one line) or very large (an
entire file at a time). That's often okay, because
Python and your operating system work behind the scenes to let your
program use convenient logical chunks for I/O, while arranging
physical I/O operations with chunk sizes that are more attuned to
performance. Reading and writing a whole file at a time is quite
likely to be okay for performance as long as the file is not
inordinately large. Specifically, file-at-a-time I/O is fine as long
as the file's data fits in your
machine's physical memory, leaving enough physical
memory available for your program and operating system to perform
whatever other tasks they need to perform at the same time. The hard
problems of I/O-bound program performance tend to come with huge
files.
If performance is an issue, don't use a file
object's readline method, which
is limited in the amount of chunking and buffering it can perform.
Using writeline, on the other hand, gives no
performance problem when that method is the one most convenient for
your program. Loop directly on the file object (in Python 2.2) to get
one line at a time with the best performance. If the file
isn't too huge, time two versions of your program,
one that loops directly on the file object and one that calls method
readlines, which reads the whole file into memory.
Either solution may prove faster. In Python 2.1, you
can't loop directly on the file object. Instead, use
method xreadlines in a for
loop. xreadlines will be deprecated in Python 2.3,
but if you need top performance in this specific case and need to
support Python 2.1, there is no alternative.
For binary files, specifically large binary files of whose contents
you need just a part on each run of your program, module
mmap, covered in Chapter 14, can
often give you both good performance and program simplicity.
Making an I/O-bound program multithreaded may sometimes afford
substantial performance gains if you can arrange your
program's architecture accordingly. Start a few
worker threads devoted exclusively to I/O, have the computational
threads request I/O operations from the I/O threads via
Queue instances, and try to post the request for
each input operation as soon as you know you'll
eventually need that data. Performance will increase only if there
are other tasks your computational threads can perform while an I/O
thread is blocked waiting for data. Basically, you get better
performance this way if you can manage to overlap computation and
waiting for data, by having different threads do the computing and
the waiting. See Chapter 14 for detailed coverage
of Python threading and a suggested architecture.
|