home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book Home Programming PerlSearch this book

24.4. Fluent Perl

We've touched on a few idioms in the preceding sections (not to mention the preceding chapters), but there are many other idioms you'll commonly see if you read programs by accomplished Perl programmers. When we speak of idiomatic Perl in this context, we don't just mean a set of arbitrary Perl expressions with fossilized meanings. Rather, we mean Perl code that shows an understanding of the flow of the language, what you can get away with when, and what that buys you. And when to buy it.

We can't hope to list all the idioms you might see--that would take a book as big as this one. Maybe two. (See the Perl Cookbook, for instance.) But here are some of the important idioms, where "important" might be defined as "that which induces hissy fits in people who think they already know just how computer languages ought to work".

  • Use => in place of a comma anywhere you think it improves readability:

    return bless $mess => $class;
    This reads, "Bless this mess into the specified class." Just be careful not to use it after a word that you don't want autoquoted:
    sub foo () { "FOO" }
    sub bar () { "BAR" }
    print foo => bar;   # prints fooBAR, not FOOBAR;
    Another good place to use => is near a literal comma that might get confused visually:
    join(", " => @array);
    Perl provides you with more than one way to do things so that you can exercise your ability to be creative. Exercise it!

  • Use the singular pronoun to increase readability:

    for (@lines) {
        $_ .= "\n";
    }
    The $_ variable is Perl's version of a pronoun, and it essentially means "it". So the code above means "for each line, append a newline to it." Nowadays you might even spell that:
    $_ .= "\n" for @lines;
    The $_ pronoun is so important to Perl that its use is mandatory in grep and map. Here is one way to set up a cache of common results of an expensive function:
    %cache = map { $_ => expensive($_) } @common_args;
    $xval = $cache{$x} || expensive($x);

  • Omit the pronoun to increase readability even further.[1]

    [1]In this section, multiple bullet items in a row all refer to the subsequent example, since some of our examples illustrate more than one idiom.

  • Use loop controls with statement modifiers.

    while (<>) {
        next if /^=for\s+(index|later)/;
        $chars += length;
        $words += split;
        $lines += y/\n//;
    }
    This is a fragment of code we used to do page counts for this book. When you're going to be doing a lot of work with the same variable, it's often more readable to leave out the pronouns entirely, contrary to common belief.

    The fragment also demonstrates the idiomatic use of next with a statement modifier to short-circuit a loop.

    The $_ variable is always the loop control variable in grep and map, but the program's reference to it is often implicit:

    @haslen = grep { length } @random;
    Here we take a list of random scalars and only pick the ones that have a length greater than 0.

  • Use for to set the antecedent for a pronoun:

    for ($episode) {
        s/fred/barney/g;
        s/wilma/betty/g;
        s/pebbles/bambam/g;
    }
    So what if there's only one element in the loop? It's a convenient way to set up "it", that is, $_. Linguistically, this is known as topicalization. It's not cheating, it's communicating.

  • Implicitly reference the plural pronoun, @_.

  • Use control flow operators to set defaults:

    sub bark {
        my Dog $spot = shift;
        my $quality  = shift || "yapping";
        my $quantity = shift || "nonstop"; 
        ...
    }
    Here we're implicitly using the other Perl pronoun, @_, which means "them". The arguments to a function always come in as "them". The shift operator knows to operate on @_ if you omit it, just as the ride operator at Disneyland might call out "Next!" without specifying which queue is supposed to shift. (There's no point in specifying, because there's only one queue that matters.)

    The || can be used to set defaults despite its origins as a Boolean operator, since Perl returns the first true value. Perl programmers often manifest a cavalier attitude toward the truth; the line above would break if, for instance, you tried to specify a quantity of 0. But as long as you never want to set either $quality or $quantity to a false value, the idiom works great. There's no point in getting all superstitious and throwing in calls to defined and exists all over the place. You just have to understand what it's doing. As long as it won't accidentally be false, you're fine.

  • Use assignment forms of operators, including control flow operators:

    $xval = $cache{$x} ||= expensive($x);
    Here we don't initialize our cache at all. We just rely on the ||= operator to call expensive($x) and assign it to $cache{$x} only if $cache{$x} is false. The result of that is whatever the new value of $cache{$x} is. Again, we take the cavalier approach towards truth, in that if we cache a false value, expensive($x) will get called again. Maybe the programmer knows that's okay, because expensive($x) isn't expensive when it returns false. Or maybe the programmer knows that expensive($x) never returns a false value at all. Or maybe the programmer is just being sloppy. Sloppiness can be construed as a form of creativity.

  • Use loop controls as operators, not just as statements. And...

  • Use commas like small semicolons:

    while (<>) {
        $comments++, next if /^#/;
        $blank++, next    if /^\s*$/;
        last              if /^__END__/;
        $code++;
    }
    print "comment = $comments\nblank = $blank\ncode = $code\n";
    This shows an understanding that statement modifiers modify statements, while next is a mere operator. It also shows the comma being idiomatically used to separate expressions much like you'd ordinarily use a semicolon. (The difference being that the comma keeps the two expressions as part of the same statement, under the control of the single statement modifier.)

  • Use flow control to your advantage:

    while (<>) {
        /^#/       and $comments++, next;
        /^\s*$/    and $blank++, next;
        /^__END__/ and last;
        $code++;
    }
    print "comment = $comments\nblank = $blank\ncode = $code\n";
    Here's the exact same loop again, only this time with the patterns out in front. The perspicacious Perl programmer understands that it compiles down to exactly the same internal codes as the previous example. The if modifier is just a backward and (or &&) conjunction, and the unless modifier is just a backward or (or ||) conjunction.

  • Use the implicit loops provided by the -n and -p switches.

  • Don't put semicolon at the end of a one-line block:

    #!/usr/bin/perl -n
    $comments++, next LINE if /#/;
    $blank++, next LINE    if /^\s*$/;
    last LINE              if /^__END__/;
    $code++;
    
    END { print "comment = $comments\nblank = $blank\ncode = $code\n" }
    This is essentially the same program as before. We put an explicit LINE label on the loop control operators because we felt like it, but we didn't really need to, since the implicit LINE loop supplied by -n is the innermost enclosing loop. We used an END to get the final print statement outside the implicit main loop, just as in awk.

  • Use here docs when the printing gets ferocious.

  • Use a meaningful delimiter on the here doc:

    END { print <<"COUNTS" }
    comment = $comments
    blank = $blank
    code = $code
    COUNTS
    Rather than using multiple prints, the fluent Perl programmer uses a multiline string with interpolation. And despite our calling it a Common Goof earlier, we've brazenly left off the trailing semicolon because it's not necessary at the end of the END block. (If we ever turn it into a multiline block, we'll put the semicolon back in.)

  • Do substitutions and translations en passant on a scalar:

    ($new = $old) =~ s/bad/good/g;
    Since lvalues are lvaluable, so to speak, you'll often see people changing a value "in passing" while it's being assigned. This could actually save a string copy internally (if we ever get around to implementing the optimization):
    chomp($answer = <STDIN>);
    Any function that modifies an argument in place can do the en passant trick. But wait, there's more!

  • Don't limit yourself to changing scalars en passant:

    for (@new = @old) { s/bad/good/g }
    Here we copy @old into @new, changing everything in passing (not all at once, of course--the block is executed repeatedly, one "it" at a time).

  • Pass named parameters using the fancy => comma operator.

  • Rely on assignment to a hash to do even/odd argument processing:

    sub bark {
        my DOG $spot = shift;
        my %parm = @_;
        my $quality  = $parm{QUALITY}  || "yapping";
        my $quantity = $parm{QUANTITY} || "nonstop"; 
        ...
    }
    
    $fido->bark( QUANTITY => "once",
                  QUALITY => "woof" );
    Named parameters are often an affordable luxury. And with Perl, you get them for free, if you don't count the cost of the hash assignment.

  • Repeat Boolean expressions until false.

  • Use minimal matching when appropriate.

  • Use the /e modifier to evaluate a replacement expression:

    #!/usr/bin/perl -p
    1 while s/^(.*?)(\t+)/$1 . ' ' x (length($2) * 4 - length($1) % 4)/e;
    This program fixes any file you receive from someone who mistakenly thinks they can redefine hardware tabs to occupy 4 spaces instead of 8. It makes use of several important idioms. First, the 1 while idiom is handy when all the work you want to do in the loop is actually done by the conditional. (Perl is smart enough not to warn you that you're using 1 in a void context.) We have to repeat this substitution because each time we substitute some number of spaces in for tabs, we have to recalculate the column position of the next tab from the beginning.

    The (.*?) matches the smallest string it can up until the first tab, using the minimal matching modifier (the question mark). In this case, we could have used an ordinary greedy * like this: ([^\t]*). But that only works because a tab is a single character, so we can use a negated character class to avoid running past the first tab. In general, the minimal matcher is much more elegant, and doesn't break if the next thing that must match happens to be longer than one character.

    The /e modifier does a substitution using an expression rather than a mere string. This lets us do the calculations we need right when we need them.

  • Use creative formatting and comments on complex substitutions:

    #!/usr/bin/perl -p
    1 while s{
        ^               # anchor to beginning
        (               # start first subgroup
            .*?         # match minimal number of characters
        )               # end first subgroup
        (               # start second subgroup
            \t+         # match one or more tabs
        )               # end second subgroup
    }
    {
        my $spacelen = length($2) * 4;  # account for full tabs
        $spacelen -= length($1) % 4;    # account for the uneven tab
        $1 . ' ' x $spacelen;           # make correct number of spaces
    }ex;
    This is probably overkill, but some people find it more impressive than the previous one-liner. Go figure.

  • Go ahead and use $` if you feel like it:

    1 while s/(\t+)/' ' x (length($1) * 4 - length($`) % 4)/e;
    Here's the shorter version, which uses $`, which is known to impact performance. Except that we're only using the length of it, so it doesn't really count as bad.

  • Use the offsets directly from the @- (@LAST_MATCH_START) and @+ (@LAST_MATCH_END) arrays:

    1 while s/\t+/' ' x (($+[0] - $-[0]) * 4 - $-[0] % 4)/e;
    This one's even shorter. (If you don't see any arrays there, try looking for array elements instead.) See @- and @+ in Chapter 28, "Special Names".

  • Use eval with a constant return value:

    sub is_valid_pattern {
        my $pat = shift;
        return eval { "" =~ /$pat/; 1 } || 0;
    }
    You don't have to use the eval {} operator to return a real value. Here we always return 1 if it gets to the end. However, if the pattern contained in $pat blows up, the eval catches it and returns undef to the Boolean conditional of the || operator, which turns it into a defined 0 (just to be polite, since undef is also false but might lead someone to believe that the is_valid_pattern subroutine is misbehaving, and we wouldn't want that, now would we?).

  • Use modules to do all the dirty work.

  • Use object factories.

  • Use callbacks.

  • Use stacks to keep track of context.

  • Use negative subscripts to access the end of an array or string:

    use XML::Parser;
    
    $p = new XML::Parser Style => 'subs';
    setHandlers $p Char => sub { $out[-1] .= $_[1] };
    
    push @out, "";
    
    sub literal {
        $out[-1] .= "C<";
        push @out, "";
    }
    
    sub literal_ {
        my $text = pop @out;
        $out[-1] .= $text . ">";
    }
    ...
    This is a snippet from the 250-line program we used to translate the XML version of the old Camel book back into pod format so we could edit it for this edition with a Real Text Editor.

    The first thing you'll notice is that we rely on the XML::Parser module (from CPAN) to parse our XML correctly, so we don't have to figure out how. That cuts a few thousand lines out of our program right there (presuming we're reimplementing in Perl everything XML::Parser does for us,[2] including translation from almost any character set into UTF-8).

    [2]Actually, XML::Parser is just a fancy wrapper around James Clark's expat XML parser.

    XML::Parser uses a high-level idiom called an object factory. In this case, it's a parser factory. When we create an XML::Parser object, we tell it which style of parser interface we want, and it creates one for us. This is an excellent way to build a testbed application when you're not sure which kind of interface will turn out to be the best in the long run. The subs style is just one of XML::Parser's interfaces. In fact, it's one of the oldest interfaces, and probably not even the most popular one these days.

    The setHandlers line shows a method call on the parser, not in arrow notation, but in "indirect object" notation, which lets you omit the parens on the arguments, among other things. The line also uses the named parameter idiom we saw earlier.

    The line also shows another powerful concept, the notion of a callback. Instead of us calling the parser to get the next item, we tell it to call us. For named XML tags like <literal>, this interface style will automatically call a subroutine of that name (or the name with an underline on the end for the corresponding end tag). But the data between tags doesn't have a name, so we set up a Char callback with the setHandlers method.

    Next we initialize the @out array, which is a stack of outputs. We put a null string into it to represent that we haven't collected any text at the current tag embedding level (0 initially).

    Now is when that callback comes back in. Whenever we see text, it automatically gets appended to the final element of the array, via the $out[-1] idiom in the callback. At the outer tag level, $out[-1] is the same as $out[0], so $out[0] ends up with our whole output. (Eventually. But first we have to deal with tags.)

    Suppose we see a <literal> tag. Then the literal subroutine gets called, appends some text to the current output, then pushes a new context onto the @out stack. Now any text up until the closing tag gets appended to that new end of the stack. When we hit the closing tag, we pop the $text we've collected back off the @out stack, and append the rest of the transmogrified data to the new (that is, the old) end of stack, the result of which is to translate the XML string, <literal>text</literal>, into the corresponding pod string, C<text>.

    The subroutines for the other tags are just the same, only different.

  • Use my without assignment to create an empty array or hash.

  • Split the default string on whitespace.

  • Assign to lists of variables to collect however many you want.

  • Use autovivification of undefined references to create them.

  • Autoincrement undefined array and hash elements to create them.

  • Use autoincrement of a %seen array to determine uniqueness.

  • Assign to a handy my temporary in the conditional.

  • Use the autoquoting behavior of braces.

  • Use an alternate quoting mechanism to interpolate double quotes.

  • Use the ?: operator to switch between two arguments to a printf.

  • Line up printf args with their % field:

    my %seen;
    while (<>) {
        my ($a, $b, $c, $d) = split;
        print unless $seen{$a}{$b}{$c}{$d}++;
    }
    if (my $tmp = $seen{fee}{fie}{foe}{foo}) {
        printf qq(Saw "fee fie foe foo" [sic] %d time%s.\n"),
                                              $tmp,  $tmp == 1 ? "" : "s";
    }
    These nine lines are just chock full of idioms. The first line makes an empty hash because we don't assign anything to it. We iterate over input lines setting "it", that is, $_, implicitly, then using an argumentless split which splits "it" on whitespace. Then we pick off the four first words with a list assignment, throwing any subsequent words away. Then we remember the first four words in a four-dimensional hash, which automatically creates (if necessary) the first three reference elements and final count element for the autoincrement to increment. (Under use warnings, the autoincrement will never warn that you're using undefined values, because autoincrement is an accepted way to define undefined values.) We then print out the line if we've never seen a line starting with these four words before, because the autoincrement is a postincrement, which, in addition to incrementing the hash value, will return the old true value if there was one.

    After the loop, we test %seen again to see if a particular combination of four words was seen. We make use of the fact that we can put a literal identifier into braces and it will be autoquoted. Otherwise, we'd have to say $seen{"fee"}{"fie"}{"foe"}{"foo"}, which is a drag even when you're not running from a giant.

    We assign the result of $seen{fee}{fie}{foe}{foo} to a temporary variable even before testing it in the Boolean context provided by the if. Because assignment returns its left value, we can still test the value to see if it was true. The my tells your eye that it's a new variable, and we're not testing for equality but doing an assignment. It would also work fine without the my, and an expert Perl programmer would still immediately notice that we used one = instead of two ==. (A semiskilled Perl programmer might be fooled, however. Pascal programmers of any skill level will foam at the mouth.)

    Moving on to the printf statement, you can see the qq() form of double quotes we used so that we could interpolate ordinary double quotes as well as a newline. We could've directly interpolated $tmp there as well, since it's effectively a double-quoted string, but we chose to do further interpolation via printf. Our temporary $tmp variable is now quite handy, particularly since we don't just want to interpolate it, but also test it in the conditional of a ?: operator to see whether we should pluralize the word "time". Finally, note that we lined up the two fields with their corresponding % markers in the printf format. If an argument is too long to fit, you can always go to the next line for the next argument, though we didn't have to in this case.

Whew! Had enough? There are many more idioms we could discuss, but this book is already sufficiently heavy. But we'd like to talk about one more idiomatic use of Perl, the writing of program generators.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.