home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


sed & awksed & awkSearch this book

11.2. Freely Available awks

There are three versions of awk whose source code is freely available. They are the Bell Labs awk, GNU awk, and mawk, by Michael Brennan. This section discusses the extensions that are common to two or more of them, and then looks at each version in detail and describes how to obtain it.

11.2.1. Common Extensions

This section discusses extensions to the awk language that are available in two or more of the freely available awks.[73]

[73]As the maintainer of gawk and the author of many of the extensions described here and in the section below on gawk, my opinion about the usefulness of these extensions may be biased. Figure You should make your own evaluation. [A.R.]

11.2.1.4. Special filenames

With any version of awk, you can write directly to the special UNIX file, /dev/tty, that is a name for the user's terminal. This can be used to direct prompts or messages to the user's attention when the output of the program is directed to a file:

printf "Enter your name:" >"/dev/tty"

This prints "Enter your name:" directly on the terminal, no matter where the standard output and the standard error are directed.

The three free awks support several special filenames, as listed in Table 11.4.

Table 11.4. Special Filenames

Filename Description
/dev/stdin Standard input (not mawk)[74]
/dev/stdout Standard output
/dev/stderr Standard error

[74]The mawk manpage recommends using "-" for the standard input, which is most portable.

Note that a special filename, like any filename, must be quoted when specified as a string constant.

The /dev/stdin, /dev/stdout, and /dev/stderr special files originated in V8 UNIX. Gawk was the first to build in special recognition of these files, followed by mawk and the Bell Labs awk.

A printerr() function

Error messages inform users about problems often related to missing or incorrect input. You can simply inform the user with a print statement. However, if the output of the program is redirected to a file, the user won't see it. Therefore, it is good practice to specify explicitly that the error message be sent to the terminal.

The following printerr() function helps to create consistent user error messages. It prints the word "ERROR" followed by a supplied message, the record number, and the current record. The following example directs output to /dev/tty:

function printerr (message) {
	# print message, record number and record
	printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/dev/tty"
}

If the output of the program is sent to the terminal screen, then error messages will be mixed in with the output. Outputting "ERROR" will help the user recognize error messages.

In UNIX, the standard destination for error messages is standard error. The rationale for writing to standard error is the same as above. To write to standard error explicitly, you must use the convoluted syntax "cat 1>&2" as in the following example:

print "ERROR" | "cat 1>&2"

This directs the output of the print statement to a pipe which executes the cat command. You can also use the system() function to execute a UNIX command such as cat or echo and direct its output to standard error.

When the special file /dev/stderr is available, this gets much simpler:

print "ERROR" > "/dev/stderr"  # recent awks only

11.2.2. Bell Labs awk

The Bell Labs awk is, of course, the direct descendant of the original V7 awk, and of the "new" awk that first became avaliable with System V Release 3.1. Source code is freely available via anonymous FTP to the host netlib.bell-labs.com. It is in the file /netlib/research/awk.bundle.Z. This is a compressed shell archive file. Be sure to use "binary," or "image" mode to transfer the file. This version of awk requires an ANSI C compiler.

There have been several distinct versions; we will identify them here according to the year they became available.

The first version of new awk became available in late 1987. It had almost everything we've described in the previous four chapters (although there are footnotes that indicate those things that are not available). This version is still in use on SunOS 4.1.x systems and some System V Release 3 UNIX systems.

In 1989, for System V Release 4, several new things were added. The only difference between this version and POSIX awk is that POSIX uses CONVFMT for number-to-string conversions, while the 1989 version still used OFMT. The new features were:

  • Escape characters in command-line assignments were now interpreted.

  • The tolower() and toupper() functions were added.

  • printf was improved: dynamic width and precision were added, and the behavior for "%c" was rationalized.

  • The return value from the srand() function was defined to be the previous seed. (The awk book didn't state what srand() returned.)

  • It became possible to use regular expressions as simple expressions. For example:

    if (/cute/ || /sweet/)
    	print "potential here!"
  • The -v option was added to allow setting variables on the command line before execution of the BEGIN procedure.

  • Multiple -f options could now be used to have multiple source files. (This originated in MKS awk, was adopted by gawk, and then added to the Bell Labs awk.)

  • The ENVIRON array was added. (This was developed independently for both MKS awk and gawk, and then added to the Bell Labs awk.)

In 1993, Brian Kernighan of Bell Labs was able to release the source code to his awk. At this point, CONVFMT became available, and the fflush() function, described above, was added. A bug-fix release was made in August of 1994.

In June of 1996, Brian Kernighan made another release. It can be retrieved either from the FTP site given above, or via a World Wide Web browser from Dr. Kernighan's Web page (http://cm.bell-labs.com/who/bwk), which refers to this version as "the one true awk." Figure This version adds several features that originated in gawk and mawk, described earlier in this chapter in the "Common Extensions" section.

11.2.3. GNU awk (gawk)

The Free Software Foundation GNU project's version of awk, gawk, implements all the features of the POSIX awk, and many more. It is perhaps the most popular of the freely available implementations; gawk is used on Linux systems, as well as various other freely available UNIX-like systems, such as NetBSD and FreeBSD.

Source code for gawk is available via anonymous FTP[75] to the host ftp.gnu.org. It is in the file ftp://ftp.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz (there may be a later version there by the time you read this). This is a tar file compressed with the gzip program, whose source code is available in the same directory. There are many sites worldwide that "mirror" the files from the main GNU distribution site; if you know of one close to you, you should get the files from there. Be sure to use "binary" or "image" mode to transfer the file(s).

[75]If you don't have Internet access and wish to get a copy of gawk, contact the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 U.S.A. The telephone number is 617-542-5942, and the fax number is 617-542-2652.

Besides the common extensions listed earlier, gawk has a number of additional features. We examine them in this section.

11.2.3.1. Command line options

Gawk has several very useful command-line options. Like most GNU programs, these options are spelled out and begin with two dashes, "--".

There are a number of other options that are less important for everyday programming and script portability; see the gawk documentation for details.

Although POSIX awk allows you to have multiple instances of the -f option, there is no easy way to use library functions from a command-line program. The --source option in gawk makes this possible.

gawk --source 'script' -f mylibs.awk file1 file2

This example runs the program in script, which can use awk functions from the file mylibs.awk. The input data comes from file1 and file2.

11.2.3.4. Extended regular expressions

Gawk provides several additional regular expression operators. These are common to most GNU programs that work with regular expressions. The extended operators are listed in Table 11.5.

Table 11.5. Gawk Extended Regular Expressions

Special Operators

Usage

\w

Matches any word-constituent character (a letter, digit, or underscore).

\W

Matches any character that is not word-constituent.

\<

Matches the empty string at the beginning of a word.

\>

Matches the empty string at the end of a word.

\y

Matches the empty string at either the beginning or end of a word (the word boundary). Other GNU software uses "\b", but that was already taken.

\B

Matches the empty string within a word.

\`

Matches the empty string at the beginning of a buffer. This is the same as a string in awk, and thus is the same as ^. It is provided for compatibility with GNU Emacs and other GNU software.

\'

Matches the empty string at the end of a buffer. This is the same as a string in awk, and thus is the same as $. It is provided for compatibility with GNU Emacs and other GNU software.

You can think of "\w" as a shorthand for the (POSIX) notation [[:alnum:]_] and "\W" as a shorthand for [^[:alnum:]_]. The following table gives examples of what the middle four operators match, borrowed from Effective AWK Programming.

Table 11.6. Examples of gawk Extended Regular Expression Operators

Expression Matches Does Not Match
\<away away stowaway
stow\> stow stowaway
\yballs?\y ball or balls ballroom or baseball
\Brat\B crate dirty rat

11.2.3.8. Additional variables

Gawk has several more system variables. They are listed in Table 11.8.

Table 11.8. Additional gawk System Variables

Variable Description
ARGIND The index in ARGV of the current input file.
ERRNO A message describing the error if getline or close() fail.
FIELDWIDTHS

A space-separated list of numbers describing the widths of the input fields.

IGNORECASE

If non-zero, pattern matches and string comparisons are case-independent.

RT The value of the input text that matched RS.

We have already seen the record terminator variable, RT, so we'll proceed to the other variables that we haven't covered yet.

All pattern matching and string comparison in awk is case sensitive. Gawk introduced the IGNORECASE variable so that you can specify that regular expressions be interpreted without regard for upper- or lowercase characters. Beginning with version 3.0 of gawk, string comparisons can also be done without case sensitivity.

The default value of IGNORECASE is zero, which means that pattern matching and string comparison are performed the same as in traditional awk. If IGNORECASE is set to a non-zero value, then case distinctions are ignored. This applies to all places where regular expressions are used, including the field separator FS, the record separator RS, and all string comparisons. It does not apply to array subscripting.

Two more gawk variables are of interest. ARGIND is set automatically by gawk to be the index in ARGV of the current input file name. This variable gives you a way to track how far along you are in the list of filenames.

Finally, if an error occurs doing a redirection for getline or during a close(), gawk sets ERRNO to a string describing the error. This makes it possible to provide descriptive error messages when something goes wrong.

11.2.3.10. A general substitution function

The 3.0 version of gawk introduced a new general substitution function, named gensub(). The sub() and gsub() functions have some problems.

For all these reasons, gawk introduced the gensub() function. The function takes at least three arguments. The first is a regular expression to search for. The second is the replacement string. The third is a flag that controls how many substitutions should be performed. The fourth argument, if present, is the original string to change. If it is not provided, the current input record ($0) is used.

The pattern can have subpatterns delimited by parentheses. For example, it can have "/(part) (one|two|three)/". Within the replacement string, a backslash followed by a digit represents the text that matched the nth subpattern.

$ echo part two | gawk '{ print gensub(/(part) (one|two|three)/, "\\2", "g") }'
two

The flag is either a string beginning with g or G, in which case the substitution happens globally, or it is a number indicating that the nth occurrence should be replaced.

$ echo a b c a b c a b c | gawk '{ print gensub(/a/, "AA", 2) }'
a b c AA b c a b c

The fourth argument is the string in which to make the change. Unlike sub() and gsub(), the target string is not changed. Instead, the new string is the return value from gensub().

$ gawk '
BEGIN { old = "hello, world"
        new = gensub(/hello/, "goodbye", 1, old)
        printf("<%s>, <%s>\n", old, new)
}'
<hello, world>, <goodbye, world>

11.2.3.11. Time management for programmers

Awk programs are very often used for processing the log files produced by various programs. Often, each record in a log file contains a timestamp, indicating when the record was produced. For both conciseness and precision, the timestamp is written as the result of the UNIX time(2) system call, which is the number of seconds since midnight, January 1, 1970 UTC. (This date is often referred to as "the Epoch.") To make it easier to generate and process log file records with these kinds of timestamps in them, gawk has two functions, systime() and strftime().

The systime() function is primarily intended for generating timestamps to go into log records. Suppose, for example, that we use an awk script to respond to CGI queries to our WWW server. We might log each query to a log file.

{
...
printf("%s:%s:%d\n", User, Host, systime()) >> "/var/log/cgi/querylog"
...
}

Such a record might look like

arnold:some.domain.com:831322007

The strftime() function[78] makes it easy to turn timestamps into human-readable dates. The format string is similar to the one used by sprintf(); it consists of literal text mixed with format specifications for different components of date and time.

[78]This function is patterned after the function of the same name in ANSI C.

$ gawk 'BEGIN { print strftime("Today is %A, %B %d, %Y") }'
Today is Sunday, May 05, 1996

The list of available formats is quite long. See your local strftime(3) manpage, and the gawk documentation for the full list. Our hypothetical CGI log file might be processed by this program:

# cgiformat --- process CGI logs
# data format is user:host:timestamp
#1
BEGIN {	FS = ":"; SUBSEP = "@" }

#2
{
# make data more obvious
	user = $1; host = $2; time = $3
# store first contact by this user
	if (! ((user, host) in first))
		first[user, host] = time
# count contacts
	count[user, host]++
# save last contact
	last[user, host] = time
}

#3
END {
# print the results
	for (contact in count) {
		i = strftime("%y-%m-%d %H:%M", first[contact])
		j = strftime("%y-%m-%d %H:%M", last[contact])
		printf "%s -> %d times between %s and %s\n",
			contact, count[contact], i, j
	}
}

The first step is to set FS to ":" to split the field correctly. We also use a neat trick and set the subscript separator to "@", so that the arrays become indexed by "user@host" strings.

In the second step, we look to see if this is the first time we've seen this user. If so (they're not in the first array), we add them. Then we increment the count of how many times they've connected. Finally we store this record's timestamp in the last array. This element keeps getting overwritten each time we see a new connection by the user. That's OK; what we will end up with is the last (most recent) connection stored in the array.

The END procedure formats the data for us. It loops through the count array, formatting the timestamps in the first and last arrays for printing. Consider a log file with the following records in it.

$ cat /var/log/cgi/querylog
arnold:some.domain.com:831322007
mary:another.domain.org:831312546
arnold:some.domain.com:831327215
mary:another.domain.org:831346231
arnold:some.domain.com:831324598

Here's what running the program produces:

$ gawk -f cgiformat.awk /var/log/cgi/querylog
mary@another.domain.org -> 2 times between 96-05-05 12:09 and 96-05-05 21:30
arnold@some.domain.com -> 3 times between 96-05-05 14:46 and 96-05-05 15:29


Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.