Published on Perl.com http://www.perl.com/pub/a/2002/02/12/bigo.html See this if you're having trouble printing code examples Optimizing Your Perl By Robert Spier Is your Perl program taking too long to run? This might be because you've chosen a data structure or algorithm that takes a long time to run. By rethinking how you've implemented a function, you might be able to realize huge gains in speed. Some Simple Complexity TheoryBefore we can talk about speeding up something, we need a way to describe how long something takes. Because we're talking about algorithms that may have varying amounts of input, the actual "time" to do something isn't conclusive. Computer scientists and mathematicians use a system called big-O notation to describe the order of magnitude of how long something will take. Big-O notation represents a worst-case analysis. There are other notations to represent the magnitude of minimum and actual runtimes. Don't let the talk of computer scientists and mathematicians scare you. The next few paragraphs talk about a way to describe the difference between orders of magnitude such as a second, minute, hour or day. Or 1, 10, 100 and 1,000. You don't need to know any fancy or scary math to understand it. It's simple. If the runtime of a function is constant, then this is written as O(1). (Note the capital or "big" O.) Constant means that no matter how much data is processed, (i.e., how much input there is) the time doesn't change. If the runtime is directly related to the amount of input
(a.k.a. linear), then this is O(n). If there is twice as much input, then the
algorithm will take twice as long to run. Some functions are Because we are only looking at orders of magnitude (is the operation
going to take one minute or one hour) we can ignore constant factors and
smaller terms. There are rules for analyzing code and determining its big-O, but the simplest way is to look at how many times you iterate over the data. if you don't iterate, then it's O(1). If you iterate over it once, then it's O(n). If you have two nested loops, then it's probably O(n^2). Example #1: GraphingI developed a new graph module for a project at my day job, because the existing one was overkill, yet did not support a feature I needed. It was easy to implement and worked very well, until the graph got very big. I needed flattened subgraphs to do dependency checks, but it was taking more than two minutes to compute (on a 1Ghz P3) for a 1,000 node graph. I needed the computation to be much faster, or my users would get mad. My initial Graph module looked like this. (Simplified to only show relevant parts.)
The add_edge method is The problematic section of the code looked like this:
And I actually needed to do that for multiple vertices;
This would make the entire flattening operation O(n^3). ( The dependency loop is O(n), flat_subgraph is effectively O(n), and in_edges is O(n)). That means that it takes much longer to compute as the Graph becomes bigger. (Imagine a curve with the following values: 1,8,27,64..., that's the curve of relative times.) CachingWith my users expecting instant feedback, something had to be done. The first thing I tried was to cache the in_edges computation.
This helped, but not as much as I had hoped. The in_edges calculation
was still expensive when it wasn't cached. It's still In situations where the same functions are called with the same arguments, caching can be useful. Mark-Jason Dominus has written a module called Memoize, which makes it easy to write caching functions. See his Web site http://perl.plover.com/MiniMemoize/ for more information. Change the StructureThe next thing I tried was to change the Graph data structure. When I originally designed it, my goal was simplicity, not necessarily speed. Now speed was becoming important and I had to make changes. I modified graph so that in_edges were calculated when the new edge was being added.
add_edge remains This was the solution I needed, getting a full flattening down from more than four minutes to only six seconds. The Importance of Good InterfacesOne of the things that made this fiddling possible was a good design for the Graph module's interface. The interface hid all the details of the internals from me, so that different implementations could be plugged in. (This is actually how I did the testing, I had a regular Graph, a Graph::Cached, and a Graph::Fast. Once I decided that I liked Graph::Fast the best, I renamed it to just Graph.) I'm glad I designed Graph the way I did. I put simplicity and design first. There is a phrase you might have heard: "Premature optimization is the root of all evil." If I had optimized the Graph module first, then I might have ended up with more complicated code that didn't work in all cases. My initial cut of Graph was slow, but worked consistently. Optimization happened later, after I had a baseline I could return to, to guarantee proper operation. Example #2: Duplicate Message FilterYou don't always need to try and optimize code to the fullest extent. Optimizing code takes your time and energy, and it isn't always worth it. If you spend two days to shave two seconds off a program, then it's probably not worth it. (Unless of course, that program is running hundreds of times each day.) If a program is "fast enough", then making it faster isn't worth it. Our second example is a duplicate message filter for e-mail. This filter is useful when you receive cc's of messages also sent to mailing lists you are on, but you don't want to see both. The general idea behind this filter is simple.
It's only the seen function that interests us. It searches some form of cache of id's. How we implement it can have a large effect on the speed of our mailfilter. (For more information on filtering mail with perl, take a look at Mail::Audit.) The two most obvious methods that come to mind for implementing seen are a simple linear search (O(n)) or the use of a persistent database/hash table to do the lookup. O(1). I'll ignore the cost of maintaining the database, eliminating old Message-Id's, etc. A linear approach writes out the message id's with some sort of separator. Some programs use null bytes, others use newlines. The database/hash method stores the message id's in a binary database like a DBM or Berkeley DB file. At the expense of using more disk space, lookups are generally faster. But - there's one more factor to take into account - overhead. The linear search has little overhead, while the DB file has a large overhead. (Overhead here means time to open the file and initialize the appropriate data structures.)
I performed a set of timings with a rough example implementation on my P2/400, and received the above results. The hash lookup speeds remained constant, while the linear lookup dropped off rapidly after 250 items in the cache. Returning to the big picture, we need to take our message-id cache size into account when determining the method to use. For this kind of cache, there's usually not a reason to keep a huge amount of entries around, only two or three days worth should catch most duplicate messages. (And probably even one day's worth in most cases.) For this problem, an ConclusionPerl is a powerful language, but it doesn't prevent the programmer from making poor decisions and shooting him/herself in the foot. One mistake is to choose the wrong data structure or algorithm for the job. By using a little bit of complexity theory, it's possible to optimize your code and get huge speed gains. I've only touched the surface of big-O notation and complexity theory. It's a highly researched topic that delves into the deep dark corners of math and computers. But, hopefully there's enough here to give you a basic tool set. As the second example showed, you don't necessarily want to spend too much time thinking about optimization. Over-optimization isn't always worth it. For More Information
Special ThanksWalt Mankowski Return to Related Articles from the O'Reilly Network .
Perl.com Compilation Copyright © 1998-2003 O'Reilly & Associates, Inc. |
|