9.2. Perl Code BenchmarkingIf you want to benchmark your Perl code, you can use the Benchmark module. For example, let's say that our code generates many long strings and finally prints them out. We wonder what is the most efficient way to handle this task—we can try to concatenate the strings into a single string, or we can store them (or references to them) in an array before generating the output. The easiest way to get an answer is to try each approach, so we wrote the benchmark shown in Example 9-3. Example 9-3. strings_benchmark.pluse Benchmark; use Symbol; my $fh = gensym; open $fh, ">/dev/null" or die $!; my($one, $two, $three) = map { $_ x 4096 } 'a'..'c'; timethese(100_000, { ref_array => sub { my @a; push @a, \($one, $two, $three); my_print(@a); }, array => sub { my @a; push @a, $one, $two, $three; my_print(@a); }, concat => sub { my $s; $s .= $one; $s .= $two; $s .= $three; my_print($s); }, }); sub my_print { for (@_) { print $fh ref($_) ? $$_ : $_; } } As you can see, we generate three big strings and then use three anonymous functions to print them out. The first one (ref_array) stores the references to the strings in an array. The second function (array) stores the strings themselves in an array. The third function (concat) concatenates the three strings into a single string. At the end of each function we print the stored data. If the data structure includes references, they are first dereferenced (relevant for the first function only). We execute each subtest 100,000 times to get more precise results. If your results are too close and are below 1 CPU clocks, you should try setting the number of iterations to a bigger number. Let's execute this benchmark and check the results: panic% perl strings_benchmark.pl Benchmark: timing 100000 iterations of array, concat, ref_array... array: 2 wallclock secs ( 2.64 usr + 0.23 sys = 2.87 CPU) concat: 2 wallclock secs ( 1.95 usr + 0.07 sys = 2.02 CPU) ref_array: 3 wallclock secs ( 2.02 usr + 0.22 sys = 2.24 CPU) First, it's important to remember that the reported wallclock times can be misleading and thus should not be relied upon. If during one of the subtests your computer was more heavily loaded than during the others, it's possible that this particular subtest will take more wallclocks to complete, but this doesn't matter for our purposes. What matters is the CPU clocks, which tell us the exact amount of CPU time each test took to complete. You can also see the fraction of the CPU allocated to usr and sys, which stand for the user and kernel (system) modes, respectively. This tells us what proportions of the time the subtest has spent running code in user mode and in kernel mode. Now that you know how to read the results, you can see that concatenation outperforms the two array functions, because concatenation only has to grow the size of the string, whereas array functions have to extend the array and, during the print, iterate over it. Moreover, the array method also creates a string copy before appending the new element to the array, which makes it the slowest method of the three. Let's make the strings much smaller. Using our original code with a small correction: my($one, $two, $three) = map { $_ x 8 } 'a'..'c'; we now make three strings of 8 characters, instead of 4,096. When we execute the modified version we get the following picture: Benchmark: timing 100000 iterations of array, concat, ref_array... array: 1 wallclock secs ( 1.59 usr + 0.01 sys = 1.60 CPU) concat: 1 wallclock secs ( 1.16 usr + 0.04 sys = 1.20 CPU) ref_array: 2 wallclock secs ( 1.66 usr + 0.05 sys = 1.71 CPU) Concatenation still wins, but this time the array method is a bit faster than ref_array, because the overhead of taking string references before pushing them into an array and dereferencing them afterward during print( ) is bigger than the overhead of making copies of the short strings. As these examples show, you should benchmark your code by rewriting parts of the code and comparing the benchmarks of the modified and original versions. Also note that benchmarks can give different results under different versions of the Perl interpreter, because each version might have built-in optimizations for some of the functions. Therefore, if you upgrade your Perl interpreter, it's best to benchmark your code again. You may see a completely different result. Another Perl code benchmarking method is to use the Time::HiRes module, which allows you to get the runtime of your code with a fine-grained resolution of the order of microseconds. Let's compare a few methods to multiply two numbers (see Example 9-4). Example 9-4. hires_benchmark_time.pluse Time::HiRes qw(gettimeofday tv_interval); my %subs = ( obvious => sub { $_[0] * $_[1] }, decrement => sub { my $a = shift; my $c = 0; $c += $_[0] while $a--; $c; }, ); for my $x (qw(10 100)) { for my $y (qw(10 100)) { for (sort keys %subs) { my $start_time = [ gettimeofday ]; my $z = $subs{$_}->($x,$y); my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time); printf "%-9.9s: Doing %3.d * %3.d = %5.d took %f seconds\n", $_, $x, $y, $z, $elapsed; } print "\n"; } } We have used two methods here. The first (obvious) is doing the normal multiplication, $z=$x*$y. The second method is using a trick of the systems where there is no built-in multiplication function available; it uses only the addition and subtraction operations. The trick is to add $x for $y times (as you did in school before you learned multiplication). When we execute the code, we get: panic% perl hires_benchmark_time.pl decrement: Doing 10 * 10 = 100 took 0.000064 seconds obvious : Doing 10 * 10 = 100 took 0.000016 seconds decrement: Doing 10 * 100 = 1000 took 0.000029 seconds obvious : Doing 10 * 100 = 1000 took 0.000013 seconds decrement: Doing 100 * 10 = 1000 took 0.000098 seconds obvious : Doing 100 * 10 = 1000 took 0.000013 seconds decrement: Doing 100 * 100 = 10000 took 0.000093 seconds obvious : Doing 100 * 100 = 10000 took 0.000012 seconds Note that if the processor is very fast or the OS has a coarse time-resolution granularity (i.e., cannot count microseconds) you may get zeros as reported times. This of course shouldn't be the case with applications that do a lot more work. If you run this benchmark again, you will notice that the numbers will be slightly different. This is because the code measures absolute time, not the real execution time (unlike the previous benchmark using the Benchmark module). You can see that doing 10*100 as opposed to 100*10 results in quite different results for the decrement method. When the arguments are 10*100, the code performs the add 100 operation only 10 times, which is obviously faster than the second invocation, 100*10, where the code performs the add 10 operation 100 times. However, the normal multiplication takes a constant time. Let's run the same code using the Benchmark module, as shown in Example 9-5. Example 9-5. hires_benchmark.pluse Benchmark; my %subs = ( obvious => sub { $_[0] * $_[1] }, decrement => sub { my $a = shift; my $c = 0; $c += $_[0] while $a--; $c; }, ); for my $x (qw(10 100)) { for my $y (qw(10 100)) { print "\nTesting $x*$y\n"; timethese(300_000, { obvious => sub {$subs{obvious}->($x, $y) }, decrement => sub {$subs{decrement}->($x, $y)}, }); } } Now let's execute the code: panic% perl hires_benchmark.pl Testing 10*10 Benchmark: timing 300000 iterations of decrement, obvious... decrement: 4 wallclock secs ( 4.27 usr + 0.09 sys = 4.36 CPU) obvious: 1 wallclock secs ( 0.91 usr + 0.00 sys = 0.91 CPU) Testing 10*100 Benchmark: timing 300000 iterations of decrement, obvious... decrement: 5 wallclock secs ( 3.74 usr + 0.00 sys = 3.74 CPU) obvious: 0 wallclock secs ( 0.87 usr + 0.00 sys = 0.87 CPU) Testing 100*10 Benchmark: timing 300000 iterations of decrement, obvious... decrement: 24 wallclock secs (24.41 usr + 0.00 sys = 24.41 CPU) obvious: 2 wallclock secs ( 0.86 usr + 0.00 sys = 0.86 CPU) Testing 100*100 Benchmark: timing 300000 iterations of decrement, obvious... decrement: 23 wallclock secs (23.64 usr + 0.07 sys = 23.71 CPU) obvious: 0 wallclock secs ( 0.80 usr + 0.00 sys = 0.80 CPU) You can observe exactly the same behavior, but this time using the average CPU clocks collected over 300,000 tests and not the absolute time collected over a single sample. Obviously, you can use the Time::HiRes module in a benchmark that will execute the same code many times to report a more precise runtime, similar to the way the Benchmark module reports the CPU time. However, there are situations where getting the average speed is not enough. For example, if you're testing some code with various inputs and calculate only the average processing times, you may not notice that for some particular inputs the code is very ineffective. Let's say that the average is 0.72 seconds. This doesn't reveal the possible fact that there were a few cases when it took 20 seconds to process the input. Therefore, getting the variance[34] in addition to the average may be important. Unfortunately Benchmark.pm cannot provide such results—system timers are rarely good enough to measure fast code that well, even on single-user systems, so you must run the code thousands of times to get any significant CPU time. If the code is slow enough that each single execution can be measured, most likely you can use the profiling tools.
Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|