Using Perl (Running Linux, 4th Edition)

13.4. Using Perl

Perl may well be the best thing to happen to the Unix programming environment in years; it is worth the price of admission to Linux alone.[49] Perl is a text- and file-manipulation language, originally intended to scan large amounts of text, process it, and produce nicely formatted reports from that data. However, as Perl has matured, it has developed into an all-purpose scripting language capable of doing everything from managing processes to communicating via TCP/IP over a network. Perl is free software originally developed by Larry Wall, the Unix guru who brought us the rn newsreader and various popular tools, such as patch. Today it is maintained by Larry and a group of volunteers.

[49]Truth be told, Perl also exists now on other systems, such as Windows. But it is not even remotely as well-known and ubiquitous there as it is on Linux.

Perl's main strength is that it incorporates the most widely used features of other powerful languages, such as C, sed, awk, and various shells, into a single interpreted script language. In the past, performing a complicated job required juggling these various languages into complex arrangements, often entailing sed scripts piping into awk scripts piping into shell scripts and eventually piping into a C program. Perl gets rid of the common Unix philosophy of using many small tools to handle small parts of one large problem. Instead, Perl does it all, and it provides many different ways of doing the same thing. In fact, this chapter was written by an artificial intelligence program developed in Perl. (Just kidding, Larry.)

Perl provides a nice programming interface to many features that were sometimes difficult to use in other languages. For example, a common task of many Unix system administration scripts is to scan a large amount of text, cut fields out of each line of text based on a pattern (usually represented as a regular expression), and produce a report based on the data. Let's say we want to process the output of the Unix last command, which displays a record of login times for all users on the system, as so:

mdw       ttypf    loomer.vpizza.co Sun Jan 16 15:30 - 15:54  (00:23)
larry     ttyp1    muadib.oit.unc.e Sun Jan 16 15:11 - 15:12  (00:00)
johnsonm  ttyp4    mallard.vpizza.c Sun Jan 16 14:34 - 14:37  (00:03)
jem       ttyq2    mallard.vpizza.c Sun Jan 16 13:55 - 13:59  (00:03)
linus     FTP      kruuna.helsinki. Sun Jan 16 13:51 - 13:51  (00:00)
linus     FTP      kruuna.helsinki. Sun Jan 16 13:47 - 13:47  (00:00)

If we want to count up the total login time for each user (given in parentheses in the last field), we could write a sed script to splice the time values from the input, an awk script to sort the data for each user and add up the times, and another awk script to produce a report based on the accumulated data. Or, we could write a somewhat complex C program to do the entire task — complex because, as any C programmer knows, text processing functions within C are somewhat limited.

However, you can easily accomplish this task with a simple Perl script. The facilities of I/O, regular-expression pattern matching, sorting by associative arrays, and number crunching are all easily accessed from a Perl program with little overhead. Perl programs are generally short and to the point, without a lot of technical mumbo jumbo getting in the way of what you want your program to actually do.

Using Perl under Linux is really no different than on other Unix systems. Several good books on Perl already exist, including the O'Reilly books Programming Perl, by Larry Wall, Randal L. Schwartz, and Tom Christiansen; Learning Perl, by Randal L. Schwartz and Tom Christiansen; Advanced Perl Programming by Sriram Srinivasan; and Perl Cookbook by Tom Christiansen and Nathan Torkington. Nevertheless, we think Perl is such a great tool that it deserves something in the way of an introduction. After all, Perl is free software, as is Linux; they go hand in hand.

13.4.1. A Sample Program

What we really like about Perl is that it lets you immediately jump to the task at hand; you don't have to write extensive code to set up data structures, open files or pipes, allocate space for data, and so on. All these features are taken care of for you in a very friendly way.

The example of login times, just discussed, serves to introduce many of the basic features of Perl. First, we'll give the entire script (complete with comments) and then a description of how it works. This script reads the output of the last command (see the previous example) and prints an entry for each user on the system, describing the total login time and number of logins for each. (Line numbers are printed to the left of each line for reference):

1       #!/usr/bin/perl 
2 
3       while (<STDIN>) {   # While we have input...   
4         # Find lines and save username, login time 
5         if (/^(\S*)\s*.*\((.*):(.*)\)$/) {   
6           # Increment total hours, minutes, and logins 
7           $hours{$1} += $2; 
8           $minutes{$1} += $3; 
9           $logins{$1}++; 
10        } 
11      } 
12 
13      # For each user in the array...        
14      foreach $user (sort(keys %hours)) { 
15         # Calculate hours from total minutes 
16         $hours{$user} += int($minutes{$user} / 60); 
17         $minutes{$user} %= 60; 
18         # Print the information for this user 
19         print "User $user, total login time "; 
20         # Perl has printf, too 
21         printf "%02d:%02d, ", $hours{$user}, $minutes{$user}; 
22         print "total logins $logins{$user}.\n"; 
23      }

Line 1 tells the loader that this script should be executed through Perl, not as a shell script. Line 3 is the beginning of the program. It is the head of a simple while loop, which C and shell programmers will be familiar with: the code within the braces from lines 4-10 should be executed while a certain expression is true. However, the conditional expression <STDIN> looks funny. Actually, this expression reads a single line from the standard input (represented in Perl through the name STDIN) and makes the line available to the program. This expression returns a true value whenever there is input.

Perl reads input one line at a time (unless you tell it to do otherwise). It also reads by default from standard input, again, unless you tell it to do otherwise. Therefore, this while loop will continuously read lines from standard input, until there are no lines left to be read.

The evil-looking mess on line 5 is just an if statement. As with most programming languages, the code within the braces (on lines 7-9) will be executed if the expression that follows the if is true. But what is the expression between the parentheses? Those readers familiar with Unix tools, such as grep and sed, will peg this immediately as a regular expression: a cryptic but useful way to represent a pattern to be matched in the input text. Regular expressions are usually found between delimiting slashes (/.../).

This particular regular expression matches lines of the form:

mdw       ttypf    loomer.vpizza.co Sun Jan 16 15:30 - 15:54  (00:23)

This expression also "remembers" the username (mdw) and the total login time for this entry (00:23). You needn't worry about the expression itself; building regular expressions is a complex subject. For now, all you need to know is that this if statement finds lines of the form given in the example, and splices out the username and login time for processing. The username is assigned to the variable $1, the hours to the variable $2, and the minutes to $3. (Variables in Perl begin with the $ character, but unlike the shell, the $ must be used when assigning to the variable as well.) This assignment is done by the regular expression match itself (anything enclosed in parentheses in a regular expression is saved for later use to one of the variables $1 through $9).

Lines 6-9 actually process these three pieces of information. And they do it in an interesting way: through the use of an associative array. Whereas a normal array is indexed with a number as a subscript, an associative array is indexed by an arbitrary string. This lends itself to many powerful applications; it allows you to associate one set of data with another set of data gathered on the fly. In our short program, the keys are the usernames, gathered from the output of last. We maintain three associative arrays, all indexed by username: hours, which records the total number of hours the user logged in; minutes, which records the number of minutes; and logins, which records the total number of logins.

As an example, referencing the variable $hours{'mdw'} returns the total number of hours that the user mdw was logged in. Similarly, if the username mdw is stored in the variable $1, referencing $hours{$1} produces the same effect.

In lines 6-9, we increment the values of these arrays according to the data on the present line of input. For example, given the input line:

jem       ttyq2    mallard.vpizza.c Sun Jan 16 13:55 - 13:59  (00:03)

line 7 increments the value of the hours array, indexed with $1 (the username, jem), by the number of hours that jem was logged in (stored in the variable $2). The Perl increment operator += is equivalent to the corresponding C operator. Line 8 increments the value of minutes for the appropriate user similarly. Line 9 increments the value of the logins array by one, using the ++ operator.

Associative arrays are one of the most useful features of Perl. They allow you to build up complex databases while parsing text. It would be nearly impossible to use a standard array for this same task. We would first have to count the number of users in the input stream and then allocate an array of the appropriate size, assigning a position in the array to each user (through the use of a hash function or some other indexing scheme). An associative array, however, allows you to index data directly using strings and without regard for the size of the array in question. (Of course, performance issues always arise when attempting to use large arrays, but for most applications this isn't a problem.)

Let's move on. Line 14 uses the Perl foreach statement, which you may be used to if you write shell scripts. (The foreach loop actually breaks down into a for loop, much like that found in C.) Here, in each iteration of the loop, the variable $user is assigned the next value in the list given by the expression sort(keys %hours). %hours simply refers to the entire associative array hours that we have constructed. The function keys returns a list of all the keys used to index the array, which is in this case a list of usernames. Finally, the sort function sorts the list returned by keys. Therefore, we are looping over a sorted list of usernames, assigning each username in turn to the variable $user.

Lines 16 and 17 simply correct for situations where the number of minutes is greater than 60; it determines the total number of hours contained in the minutes entry for this user and increments hours accordingly. The int function returns the integral portion of its argument. (Yes, Perl handles floating-point numbers as well; that's why use of int is necessary.)

Finally, lines 19-22 print the total login time and number of logins for each user. The simple print function just prints its arguments, like the awk function of the same name. Note that variable evaluation can be done within a print statement, as on lines 19 and 22. However, if you want to do some fancy text formatting, you need to use the printf function (which is just like its C equivalent). In this case, we wish to set the minimum output length of the hours and minutes values for this user to 2 characters wide, and to left-pad the output with zeroes. To do this, we use the printf command on line 21.

If this script is saved in the file logintime, we can execute it as follows:

papaya$ last | logintime 
User johnsonm, total login time 01:07, total logins 11. 
User kibo, total login time 00:42, total logins 3. 
User linus, total login time 98:50, total logins 208. 
User mdw, total login time 153:03, total logins 290. 
papaya$

Of course, this example doesn't serve well as a Perl tutorial, but it should give you some idea of what it can do. We encourage you to read one of the excellent Perl books out there to learn more.

13.4.2. More Features

The previous example introduced the most commonly used Perl features by demonstrating a living, breathing program. There is much more where that came from — in the way of both well-known and not-so-well-known features.

As we mentioned, Perl provides a report-generation mechanism beyond the standard print and printf functions. Using this feature, the programmer defines a report "format" that describes how each page of the report will look. For example, we could have included the following format definition in our example:

format STDOUT_TOP = 
User           Total login time     Total logins
-------------- -------------------- -------------------
.
format STDOUT =
@<<<<<<<<<<<<< @<<<<<<<<            @####
$user,         $thetime,            $logins{$user}
.

The STDOUT_TOP definition describes the header of the report, which will be printed at the top of each page of output. The STDOUT format describes the look of each line of output. Each field is described beginning with the @ character; @<<<< specifies a left-justified text field, and @#### specifies a numeric field. The line below the field definitions gives the names of the variables to use in printing the fields. Here, we have used the variable $thetime to store the formatted time string.

To use this report for the output, we replace lines 19-22 in the original script with the following:

$thetime = sprintf("%02d:%02d", $hours{$user}, $minutes{$user});
write;

The first line uses the sprintf function to format the time string and save it in the variable $thetime; the second line is a write command that tells Perl to go off and use the given report format to print a line of output.

Using this report format, we'll get something looking like this:

User           Total login time     Total logins
-------------- -------------------- -------------------
johnsonm       01:07                   11
kibo           00:42                    3
linus          98:50                  208
mdw            153:03                 290

Using other report formats we can achieve different (and better-looking) results.

Perl comes with a huge number of modules that you can plug in to your programs for quick access to very powerful features. A popular online archive called CPAN (for Comprehensive Perl Archive Network) contains even more modules: net modules that let you send mail and carry on with other networking tasks, modules for dumping data and debugging, modules for manipulating dates and times, modules for math functions — the list could go on for pages.

If you hear of an interesting module, check first to see whether it's already loaded on your system. You can look at the directories where modules are located (probably under /usr/lib/perl5) or just try loading in the module and see if it works. Thus, the command:

$ perl -MCGI -e 1
Can't locate CGI in @INC...

gives you the sad news that the CGI.pm module is not on your system. CGI.pm is popular enough to be included in the standard Perl distribution, and you can install it from there, but for many modules you will have to go to CPAN (and some don't make it into CPAN either). CPAN, which is maintained by Jarkko Hietaniemi and Andreas König, resides on dozens of mirror sites around the world because so many people want to download its modules. The easiest way to get onto CPAN is to visit http://www.perl.com/CPAN-local/.

The following program — which we wanted to keep short, and therefore neglected to find a useful task to perform — shows two modules, one that manipulates dates and times in a sophisticated manner and another that sends mail. The disadvantage of using such powerful features is that a huge amount of code is loaded from them, making the runtime size of the program quite large:

#! /usr/local/bin/perl

# We will illustrate Date and Mail modules
use Date::Manip;
use Mail::Mailer;

# Illustration of Date::Manip module
if ( Date_IsWorkDay( "today", 1) )  {

    # Today is a workday
    $date = ParseDate( "today" );

}
else {

    # Today is not a workday, so choose next workday
    $date=DateCalc( "today" , "+ 1 business day" );

}

# Convert date from compact string to readable string like "April  8"
$printable_date = UnixDate( $date , "%B %e" );

# Illustration of Mail::Mailer module
my ($to) = "the_person\@you_want_to.mail_to";
my ($from) = "owner_of_script\@system.name";

$mail = Mail::Mailer->new;

$mail->open(
            {
                From => $from,
                To => $to,
                Subject => "Automated reminder",
            }
           );

print $mail <<"MAIL_BODY";
If you are at work on or after
$printable_date,
you will get this mail.
MAIL_BODY

$mail->close;

# The mail has been sent! (Assuming there were no errors.)

The reason packages are so easy to use is that Perl added object-oriented features in version 5. The Date module used in the previous example is not object-oriented, but the Mail module is. The $mail variable in the example is a Mailer object, and it makes mailing messages straightforward through methods like new, open, and close.

To do some major task like parsing HTML, just read in the proper CGI package and issue a new command to create the proper object — all the functions you need for parsing HTML will then be available.

If you want to give a graphical interface to your Perl script, you can use the Tk module, which originally was developed for use with the Tcl language, the Gtk module, which uses the newer GIMP Toolkit (GTK), or the Qt module, which uses the Qt toolkit that also forms the base of the KDE. The book Learning Perl/Tk by Nancy Walsh (O'Reilly) shows you how to do graphics with the Perl/Tk module.

Another abstruse feature of Perl is its ability to (more or less) directly access several Unix system calls, including interprocess communications. For example, Perl provides the functions msgctl, msgget, msgsnd, and msgrcv from System V IPC. Perl also supports the BSD socket implementation, allowing communications via TCP/IP directly from a Perl program. No longer is C the exclusive language of networking daemons and clients. A Perl program loaded with IPC features can be very powerful indeed — especially considering that many client-server implementations call for advanced text processing features such as those provided by Perl. It is generally easier to parse protocol commands transmitted between client and server from a Perl script, rather than write a complex C program to do the work.

As an example, take the well-known SMTP daemon, which handles the sending and receiving of electronic mail. The SMTP protocol uses internal commands such as recv from and mail to to enable the client to communicate with the server. Either the client or the server, or both, can be written in Perl, and can have full access to Perl's text- and file-manipulation features as well as the vital socket communication functions.

Perl is a fixture of CGI programming — that is, writing small programs that run on a web server and help web pages become more interactive.

13.4.3. Pros and Cons

One of the features of (some might say "problems with") Perl is the ability to abbreviate — and obfuscate — code considerably. In the first script, we have used several common shortcuts. For example, input into the Perl script is read into the variable $_. However, most operations act on the variable $_ by default, so it's usually not necessary to reference $_ by name.

Perl also gives you several ways of doing the same thing, which can, of course, be either a blessing or a curse depending on how you look at it. In Programming Perl, Larry Wall gives the following example of a short program that simply prints its standard input. All the following statements do the same thing:

while ($_ = <STDIN>) { print; }
while (<STDIN>) { print; }
for (;<STDIN>;) { print; }
print while $_ = <STDIN>;
print while <STDIN>;

The programmer can use the syntax most appropriate for the situation at hand.

Perl is popular, and not just because it is useful. Because Perl provides much in the way of eccentricity, it gives hackers something to play with, so to speak. Perl programmers are constantly outdoing each other with trickier bits of code. Perl lends itself to interesting kludges, neat hacks, and both very good and very bad programming. Unix programmers see it as a challenging medium to work with — because Perl is relatively new, not all the possibilities have been exploited. Even if you find Perl too baroque for your taste, there is still something to be said for its artistry. The ability to call oneself a "Perl hacker" is a point of pride within the Unix community.