11.2 Freely Available awksThere are three versions of awk whose source code is freely available. They are the Bell Labs awk, GNU awk, and mawk, by Michael Brennan. This section discusses the extensions that are common to two or more of them, and then looks at each version in detail and describes how to obtain it. 11.2.1 Common ExtensionsThis section discusses extensions to the awk language that are available in two or more of the freely available awks.[2]
11.2.1.1 Deleting all elements of an arrayAll three free awks extend the delete statement, making it possible to delete all the elements of an array at one time. The syntax is:
Normally, to delete every element from an array, you have to use a loop, like this. for (i in data) delete data[i] With the extended version of the delete statement, you can simply use delete data This is particularly useful for arrays with lots of subscripts; this version is considerably faster than the one using a loop. Even though it no longer has any elements, you cannot use the array name as a simple variable. Once an array, always an array. This extension appeared first in gawk, then in mawk and the Bell Labs awk. 11.2.1.2 Obtaining individual charactersAll three awks extend field splitting and array splitting as follows. If the value of FS is the empty string, then each character of the input record becomes a separate field. This greatly simplifies cases where it's necessary to work with individual characters. Similarly, if the third argument to the split() function is the empty string, each character in the original string will become a separate element of the target array. Without these extensions, you have to use repeated calls to the substr() function to obtain individual characters. This extension appeared first in mawk, then in gawk and the Bell Labs awk. 11.2.1.3 Flushing buffered outputThe 1993 version of the Bell Labs awk introduced a new function that is not in the POSIX standard, fflush() . Like close() , the argument to fflush() is the name of an open file or pipe. Unlike close() , the fflush() function only works on output files and pipes. Most programs buffer their output, storing data to be written to a file or pipe in an internal chunk of memory until there's enough to send on to the destination. Occasionally, it's useful for the programmer to be able to explicitly flush the buffer, that is, force all buffered data to actually be delivered. This is the purpose of the fflush() function. This function appeared first in the Bell Labs awk, then in gawk and mawk. 11.2.1.4 Special filenamesWith any version of awk, you can write directly to the special UNIX file, /dev/tty , that is a name for the user's terminal. This can be used to direct prompts or messages to the user's attention when the output of the program is directed to a file: printf "Enter your name:" >"/dev/tty" This prints "Enter your name:" directly on the terminal, no matter where the standard output and the standard error are directed. The three free awks support several special filenames, as listed in Table 11.4 .
Note that a special filename, like any filename, must be quoted when specified as a string constant. The /dev/stdin , /dev/stdout , and /dev/stderr special files originated in V8 UNIX. Gawk was the first to build in special recognition of these files, followed by mawk and the Bell Labs awk. 11.2.1.5 The nextfile statementThe nextfile statement is similar to next , but it operates at a higher level. When nextfile is executed, the current data file is abandoned, and processing starts over at the top of the script, using the first record of the following file. This is useful when you know that you only need to process part of a file; there's no need to then set up a loop to skip records using next . The nextfile statement originated in gawk, and then was added to the Bell Labs awk. It will be available in mawk, starting with version 1.4. 11.2.1.6 Regular expression record separators (gawk and mawk)Gawk and mawk allow RS to be a full regular expression, not just a single character. In that case, the records are separated by the longest text in the input that matches the regular expression. Gawk also sets RT (the record terminator) to the actual input text that matched RS . An example of this is given below. The ability to have RS be a regular expression first appeared in mawk, and was later added to gawk. 11.2.2 Bell Labs awkThe Bell Labs awk is, of course, the direct descendant of the original V7 awk, and of the "new" awk that first became avaliable with System V Release 3.1. Source code is freely available via anonymous FTP to the host netlib.bell-labs.com . It is in the file /netlib/research/awk.bundle.Z . This is a compressed shell archive file. Be sure to use "binary," or "image" mode to transfer the file. This version of awk requires an ANSI C compiler. There have been several distinct versions; we will identify them here according to the year they became available. The first version of new awk became available in late 1987. It had almost everything we've described in the previous four chapters (although there are footnotes that indicate those things that are not available). This version is still in use on SunOS 4.1.x systems and some System V Release 3 UNIX systems. In 1989, for System V Release 4, several new things were added. The only difference between this version and POSIX awk is that POSIX uses CONVFMT for number-to-string conversions, while the 1989 version still used OFMT . The new features were:
In 1993, Brian Kernighan of Bell Labs was able to release the source code to his awk. At this point, CONVFMT became available, and the fflush() function, described above, was added. A bug-fix release was made in August of 1994. In June of 1996, Brian Kernighan made another release. It can be retrieved either from the FTP site given above, or via a World Wide Web browser from Dr. Kernighan's Web page (http://cm.bell-labs.com/who/bwk ), which refers to this version as "the one true awk." :-) This version adds several features that originated in gawk and mawk, described earlier in this chapter in the "Common Extensions" section. 11.2.3 GNU awk (gawk)The Free Software Foundation GNU project's version of awk, gawk, implements all the features of the POSIX awk, and many more. It is perhaps the most popular of the freely available implementations; gawk is used on Linux systems, as well as various other freely available UNIX-like systems, such as NetBSD and FreeBSD. Source code for gawk is available via anonymous FTP[4] to the host ftp.gnu.ai.mit.edu . It is in the file /pub/gnu/gawk-3.0.3.tar.gz (there may be a later version there by the time you read this). This is a tar file compressed with the gzip program, whose source code is available in the same directory. There are many sites worldwide that "mirror" the files from the main GNU distribution site; if you know of one close to you, you should get the files from there. Be sure to use "binary" or "image" mode to transfer the file(s).
Besides the common extensions listed earlier, gawk has a number of additional features. We examine them in this section. 11.2.3.1 Command line optionsGawk has several very useful command-line options. Like most GNU programs, these options are spelled out and begin with two dashes, "--".
There are a number of other options that are less important for everyday programming and script portability; see the gawk documentation for details. Although POSIX awk allows you to have multiple instances of the -f option, there is no easy way to use library functions from a command-line program. The --source option in gawk makes this possible. gawk --source ' This example runs the program in script , which can use awk functions from the file mylibs.awk . The input data comes from file1 and file2 . 11.2.3.2 An awk program search pathGawk allows you to specify an environment variable named AWKPATH that defines a search path for awk program files. By default, it is defined to be .:/usr/local/share/awk . Thus, when a filename is specified with the -f option, the two default directories will be searched, beginning with the current directory. Note that if the filename contains a "/", then no search is performed. For example, if mylibs.awk was a file of awk functions in /usr/local/share/awk , and myprog.awk was a program in the current directory, we run gawk like this: gawk -f myprog.awk -f mylibs.awk datafile1 Gawk would find each file in the appropriate place. This makes it much easier to have and use awk library functions. 11.2.3.3 Line continuationGawk allows you to break lines after either a "?" or ":". You can also continue strings across newlines using a backslash. $ 11.2.3.4 Extended regular expressionsGawk provides several additional regular expression operators. These are common to most GNU programs that work with regular expressions. The extended operators are listed in Table 11.5 .
You can think of "\w" as a shorthand for the (POSIX) notation [[:alnum:]_] and "\W" as a shorthand for [^[:alnum:]_] . The following table gives examples of what the middle four operators match, borrowed from Effective AWK Programming .
11.2.3.5 Regular expression record terminatorsBesides allowing RS to be a regular expression, gawk sets the variable RT (record terminator) to the actual input text that matched the value of RS . Here is a simple example, due to Michael Brennan, that shows the power of gawk's RS and RT variables. As we have seen, one of the most common uses of sed is its substitute command (s/old/new/g ). By setting RS to the pattern to match, and ORS to the replacement text, a simple print statement can print the unchanged text followed by the replacement text. $ There is one wrinkle; at end of file, RT will be empty, so we use a printf statement to print the record.[5] We could run the program like this.
$ 11.2.3.6 Separating fieldsBesides the regular way that awk lets you split the input into records and the record into fields, gawk gives you some additional capabilities. First, as mentioned above, if the value of FS is the empty string, then each character of the input record becomes a separate field. Second, the special variable FIELDWIDTHS can be used to split out data that occurs in fixed-width columns. Such data may or may not have whitespace separating the values of the fields. FIELDWIDTHS = "5 6 8 3" Here, the record has four fields: $1 is five characters wide, $2 is six characters wide, and so on. Assigning a value to FIELDWIDTHS causes gawk to start using it for field splitting. Assigning a value to FS causes gawk to return to the regular field splitting mechanism. Use FS = FS to make this happen without having to save the value of FS in an extra variable. This facility would be of most use when working with fixed-width field data, where there may not be any whitespace separating fields, or when intermediate fields may be all blank. 11.2.3.7 Additional special filesGawk has a number of additional special filenames that it interprets internally. All of the special filenames are listed in Table 11.7 .
The first three were described earlier. The fourth filename provides access to any open file descriptor that may have been inherited from gawk's parent process (usually the shell). You can use file descriptor 0 for standard input, 1 for standard output, and 2 for standard error. The second group of special files, labeled "obsolete," have been in gawk for a while, but are being phased out. They will be replaced by a PROCINFO array, whose subscipts are the desired item and whose element value is the associated value. For example, you would use PROCINFO["pid"] to get the current process ID, instead of using getline pid < "/dev/pid" . Check the gawk documentation to see if PROCINFO is available and if these filenames are still supported. 11.2.3.8 Additional variablesGawk has several more system variables. They are listed in Table 11.8 .
We have already seen the record terminator variable, RT , so we'll proceed to the other variables that we haven't covered yet. All pattern matching and string comparison in awk is case sensitive. Gawk introduced the IGNORECASE variable so that you can specify that regular expressions be interpreted without regard for upper- or lowercase characters. Beginning with version 3.0 of gawk, string comparisons can also be done without case sensitivity. The default value of IGNORECASE is zero, which means that pattern matching and string comparison are performed the same as in traditional awk. If IGNORECASE is set to a non-zero value, then case distinctions are ignored. This applies to all places where regular expressions are used, including the field separator FS , the record separator RS , and all string comparisons. It does not apply to array subscripting. Two more gawk variables are of interest. ARGIND is set automatically by gawk to be the index in ARGV of the current input file name. This variable gives you a way to track how far along you are in the list of filenames. Finally, if an error occurs doing a redirection for getline or during a close() , gawk sets ERRNO to a string describing the error. This makes it possible to provide descriptive error messages when something goes wrong. 11.2.3.9 Additional functionsGawk has one additional string function, and two functions for dealing with the current date and time. They are listed in Table 11.9 .
11.2.3.10 A general substitution functionThe 3.0 version of gawk introduced a new general substitution function, named gensub() . The sub() and gsub() functions have some problems.
For all these reasons, gawk introduced the gensub() function. The function takes at least three arguments. The first is a regular expression to search for. The second is the replacement string. The third is a flag that controls how many substitutions should be performed. The fourth argument, if present, is the original string to change. If it is not provided, the current input record ($0) is used. The pattern can have subpatterns delimited by parentheses. For example, it can have "/(part) (one|two|three)/". Within the replacement string, a backslash followed by a digit represents the text that matched the n th subpattern. $ The flag is either a string beginning with g or G , in which case the substitution happens globally, or it is a number indicating that the n th occurrence should be replaced. $ The fourth argument is the string in which to make the change. Unlike sub() and gsub() , the target string is not changed. Instead, the new string is the return value from gensub() . $ 11.2.3.11 Time management for programmersAwk programs are very often used for processing the log files produced by various programs. Often, each record in a log file contains a timestamp, indicating when the record was produced. For both conciseness and precision, the timestamp is written as the result of the UNIX time (2) system call, which is the number of seconds since midnight, January 1, 1970 UTC. (This date is often referred to as "the Epoch.") To make it easier to generate and process log file records with these kinds of timestamps in them, gawk has two functions, systime() and strftime() . The systime() function is primarily intended for generating timestamps to go into log records. Suppose, for example, that we use an awk script to respond to CGI queries to our WWW server. We might log each query to a log file. { ... printf("%s:%s:%d\n", User, Host, systime()) >> "/var/log/cgi/querylog" ... } Such a record might look like arnold:some.domain.com:831322007 The strftime() function[7] makes it easy to turn timestamps into human-readable dates. The format string is similar to the one used by sprintf() ; it consists of literal text mixed with format specifications for different components of date and time.
$ The list of available formats is quite long. See your local strftime (3) manpage, and the gawk documentation for the full list. Our hypothetical CGI log file might be processed by this program: # cgiformat --- process CGI logs # data format is user:host:timestamp #1 BEGIN { FS = ":"; SUBSEP = "@" } #2 { # make data more obvious user = $1; host = $2; time = $3 # store first contact by this user if (! ((user, host) in first)) first[user, host] = time # count contacts count[user, host]++ # save last contact last[user, host] = time } #3 END { # print the results for (contact in count) { i = strftime("%y-%m-%d %H:%M", first[contact]) j = strftime("%y-%m-%d %H:%M", last[contact]) printf "%s -> %d times between %s and %s\n", contact, count[contact], i, j } } The first step is to set FS to ":" to split the field correctly. We also use a neat trick and set the subscript separator to "@", so that the arrays become indexed by "user@host" strings. In the second step, we look to see if this is the first time we've seen this user. If so (they're not in the first array), we add them. Then we increment the count of how many times they've connected. Finally we store this record's timestamp in the last array. This element keeps getting overwritten each time we see a new connection by the user. That's OK; what we will end up with is the last (most recent) connection stored in the array. The END procedure formats the data for us. It loops through the count array, formatting the timestamps in the first and last arrays for printing. Consider a log file with the following records in it. $ Here's what running the program produces: $ 11.2.4 Michael's awk (mawk)The third freely available awk is mawk, written by Michael Brennan. This program is upwardly compatible with POSIX awk, and has a few extensions as well. It is solid and performs very well. Source code for mawk is freely available via anonymous FTP from ftp.whidbey.net . It is in /pub/brennan/mawk1.3.3.tar.gz . (There may be a later version there by the time you read this.) This is also a tar file compressed with the gzip program. Be sure to use "binary," or "image" mode to transfer the file. Mawk's primary advantages are its speed and robustness. Although it has fewer features than gawk, it almost always outperforms it.[8] Besides UNIX systems, mawk also runs under MS-DOS.
The common extensions described above are also available in mawk. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|