Freely Available awks (sed & awk, Second Edition)

There are three versions of awk whose source code is freely available. They are the Bell Labs awk, GNU awk, and mawk, by Michael Brennan. This section discusses the extensions that are common to two or more of them, and then looks at each version in detail and describes how to obtain it.

11.2.3. GNU awk (gawk)

The Free Software Foundation GNU project's version of awk, gawk, implements all the features of the POSIX awk, and many more. It is perhaps the most popular of the freely available implementations; gawk is used on Linux systems, as well as various other freely available UNIX-like systems, such as NetBSD and FreeBSD.

Source code for gawk is available via anonymous FTP [75] to the host ftp.gnu.org. It is in the file ftp://ftp.gnu.org/gnu/gawk/gawk-3.0.4.tar.gz (there may be a later version there by the time you read this). This is a tar file compressed with the gzip program, whose source code is available in the same directory. There are many sites worldwide that "mirror" the files from the main GNU distribution site; if you know of one close to you, you should get the files from there. Be sure to use "binary" or "image" mode to transfer the file(s).

[75]If you don't have Internet access and wish to get a copy of gawk, contact the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 U.S.A. The telephone number is 617-542-5942, and the fax number is 617-542-2652.

Besides the common extensions listed earlier, gawk has a number of additional features. We examine them in this section.

11.2.3.1. Command line options

Gawk has several very useful command-line options. Like most GNU programs, these options are spelled out and begin with two dashes, "--".

--lint and --lint-old cause gawk to check your program, both at parse-time and at run-time, for constructs that are dubious or nonportable to other versions of awk. The --lint-old option warns about function calls that are not portable to the original version of awk. It is separate from --lint, since most systems now have some version of new awk.
--traditional disables GNU-specific extensions, such as the time functions and gensub() (see below). With this option, gawk is intended to behave the same as the Bell Labs awk.
--re-interval enables full POSIX regular expression matching, by allowing gawk to recognize interval expressions (such as "/stuff{1,3}/").
--posix disables all extensions that are not specified in the POSIX standard. This option also turns on recognition of interval expressions.

There are a number of other options that are less important for everyday programming and script portability; see the gawk documentation for details.

Although POSIX awk allows you to have multiple instances of the -f option, there is no easy way to use library functions from a command-line program. The --source option in gawk makes this possible.

gawk --source 'script' -f mylibs.awk file1 file2

This example runs the program in script, which can use awk functions from the file mylibs.awk. The input data comes from file1 and file2.

11.2.3.2. An awk program search path

Gawk allows you to specify an environment variable named AWKPATH that defines a search path for awk program files. By default, it is defined to be .:/usr/local/share/awk. Thus, when a filename is specified with the -f option, the two default directories will be searched, beginning with the current directory. Note that if the filename contains a "/", then no search is performed.

For example, if mylibs.awk was a file of awk functions in /usr/local/share/awk, and myprog.awk was a program in the current directory, we run gawk like this:

gawk -f myprog.awk -f mylibs.awk datafile1

Gawk would find each file in the appropriate place. This makes it much easier to have and use awk library functions.

11.2.3.3. Line continuation

Gawk allows you to break lines after either a "?" or ":". You can also continue strings across newlines using a backslash.

$ gawk 'BEGIN { print "hello, \
> world" }'
hello, world

11.2.3.4. Extended regular expressions

Gawk provides several additional regular expression operators. These are common to most GNU programs that work with regular expressions. The extended operators are listed in Table 11.5.

Table 11.5. Gawk Extended Regular Expressions

Special Operators	Usage
\w	Matches any word-constituent character (a letter, digit, or underscore).
\W	Matches any character that is not word-constituent.
\<	Matches the empty string at the beginning of a word.
\>	Matches the empty string at the end of a word.
\y	Matches the empty string at either the beginning or end of a word (the word boundary). Other GNU software uses "\b", but that was already taken.
\B	Matches the empty string within a word.
\`	Matches the empty string at the beginning of a buffer. This is the same as a string in awk, and thus is the same as ^. It is provided for compatibility with GNU Emacs and other GNU software.
\'	Matches the empty string at the end of a buffer. This is the same as a string in awk, and thus is the same as $. It is provided for compatibility with GNU Emacs and other GNU software.

You can think of "\w" as a shorthand for the (POSIX) notation [[:alnum:]_] and "\W" as a shorthand for [^[:alnum:]_]. The following table gives examples of what the middle four operators match, borrowed from Effective AWK Programming.

Table 11.6. Examples of gawk Extended Regular Expression Operators

Expression	Matches	Does Not Match
\<away	away	stowaway
stow\>	stow	stowaway
\yballs?\y	ball or balls	ballroom or baseball
\Brat\B	crate	dirty rat

11.2.3.5. Regular expression record terminators

Besides allowing RS to be a regular expression, gawk sets the variable RT (record terminator) to the actual input text that matched the value of RS.

Here is a simple example, due to Michael Brennan, that shows the power of gawk's RS and RT variables. As we have seen, one of the most common uses of sed is its substitute command (s/old/new/g). By setting RS to the pattern to match, and ORS to the replacement text, a simple print statement can print the unchanged text followed by the replacement text.

$ cat simplesed.awk
# simplesed.awk --- do s/old/new/g using just print
#    Thanks to Michael Brennan for the idea
#
# NOTE! RS and ORS must be set on the command line
{
    if (RT == "")
        printf "%s", $0
    else
        print
}

There is one wrinkle; at end of file, RT will be empty, so we use a printf statement to print the record.[76] We could run the program like this.

[76]See Effective AWK Programming [Robbins], Section 16.2.8, for an elaborate version of this program.

$ cat simplesed.data
"This OLD house" is a great show.
I like shopping for old things at garage sales.
$ gawk -f simplesed.awk RS="old|OLD" ORS="brand new" simplesed.data
"This brand new house" is a great show.
I like shopping for brand new things at garage sales.

11.2.3.6. Separating fields

Besides the regular way that awk lets you split the input into records and the record into fields, gawk gives you some additional capabilities.

First, as mentioned above, if the value of FS is the empty string, then each character of the input record becomes a separate field.

Second, the special variable FIELDWIDTHS can be used to split out data that occurs in fixed-width columns. Such data may or may not have whitespace separating the values of the fields.

FIELDWIDTHS = "5 6 8 3"

Here, the record has four fields: $1 is five characters wide, $2 is six characters wide, and so on. Assigning a value to FIELDWIDTHS causes gawk to start using it for field splitting. Assigning a value to FS causes gawk to return to the regular field splitting mechanism. Use FS = FS to make this happen without having to save the value of FS in an extra variable.

This facility would be of most use when working with fixed-width field data, where there may not be any whitespace separating fields, or when intermediate fields may be all blank.

11.2.3.7. Additional special files

Gawk has a number of additional special filenames that it interprets internally. All of the special filenames are listed in Table 11.7.

Table 11.7. Gawk's Special Filenames

Filename	Description
/dev/stdin	Standard input.
/dev/stdout	Standard output.
/dev/stderr	Standard error.
/dev/fd/`n`	The file referenced as file descriptor n.
Obsolete Filename	Description
/dev/pid	Returns a record containing the process ID number.
/dev/ppid	Returns a record containing the parent process ID number.
/dev/pgrpid	Returns a record containing the process group ID number.
/dev/user	Returns a record with the real and effective user IDs, the real and effective group IDs, and if available, any secondary group IDs.

The first three were described earlier. The fourth filename provides access to any open file descriptor that may have been inherited from gawk's parent process (usually the shell). You can use file descriptor 0 for standard input, 1 for standard output, and 2 for standard error.

The second group of special files, labeled "obsolete," have been in gawk for a while, but are being phased out. They will be replaced by a PROCINFO array, whose subscipts are the desired item and whose element value is the associated value.

For example, you would use PROCINFO["pid"] to get the current process ID, instead of using getline pid < "/dev/pid". Check the gawk documentation to see if PROCINFO is available and if these filenames are still supported.

11.2.3.8. Additional variables

Gawk has several more system variables. They are listed in Table 11.8.

Table 11.8. Additional gawk System Variables

Variable	Description
ARGIND	The index in ARGV of the current input file.
ERRNO	A message describing the error if getline or close() fail.
FIELDWIDTHS	A space-separated list of numbers describing the widths of the input fields.
IGNORECASE	If non-zero, pattern matches and string comparisons are case-independent.
RT	The value of the input text that matched RS.

We have already seen the record terminator variable, RT, so we'll proceed to the other variables that we haven't covered yet.

All pattern matching and string comparison in awk is case sensitive. Gawk introduced the IGNORECASE variable so that you can specify that regular expressions be interpreted without regard for upper- or lowercase characters. Beginning with version 3.0 of gawk, string comparisons can also be done without case sensitivity.

The default value of IGNORECASE is zero, which means that pattern matching and string comparison are performed the same as in traditional awk. If IGNORECASE is set to a non-zero value, then case distinctions are ignored. This applies to all places where regular expressions are used, including the field separator FS, the record separator RS, and all string comparisons. It does not apply to array subscripting.

Two more gawk variables are of interest. ARGIND is set automatically by gawk to be the index in ARGV of the current input file name. This variable gives you a way to track how far along you are in the list of filenames.

Finally, if an error occurs doing a redirection for getline or during a close(), gawk sets ERRNO to a string describing the error. This makes it possible to provide descriptive error messages when something goes wrong.

11.2.3.9. Additional functions

Gawk has one additional string function, and two functions for dealing with the current date and time. They are listed in Table 11.9.

Table 11.9. Additional gawk Functions

Gawk Function	Description
gensub(r, s, h, t)	If h is a string starting with g or G, globally substitutes s for r in t. Otherwise, h is a number: substitutes for the h'th occurrence. Returns the new value, t is unchanged. If t is not supplied, defaults to $0.
systime()	Returns the current time of day in seconds since the Epoch (00:00 a.m., January 1, 1970 UTC).
strftime(format, timestamp)	Formats timestamp (of the same form returned by systime()) according to format. If no timestamp, use current time. If no format either, use a default format whose output is similar to the date command.

11.2.3.10. A general substitution function

The 3.0 version of gawk introduced a new general substitution function, named gensub(). The sub() and gsub() functions have some problems.

You can change either the first occurrence of a pattern or all the occurrences of a pattern. There is no way to change, say, only the third occurrence of a pattern but not the ones before it or after it.

Both sub() and gsub() change the actual target string, which may be undesirable.

It is impossible to get sub() and gsub() to emit a literal backslash followed by the matched text, because an ampersand preceded by a backslash is never replaced.[77]

[77]A full discussion is given in Effective AWK Programming [Robbins], Section 12.3. The details are not for the faint of heart.
There is no way to get at parts of the matched text, analogous to the $...$ construct in sed.

For all these reasons, gawk introduced the gensub() function. The function takes at least three arguments. The first is a regular expression to search for. The second is the replacement string. The third is a flag that controls how many substitutions should be performed. The fourth argument, if present, is the original string to change. If it is not provided, the current input record ($0) is used.

The pattern can have subpatterns delimited by parentheses. For example, it can have "/(part) (one|two|three)/". Within the replacement string, a backslash followed by a digit represents the text that matched the nth subpattern.

$ echo part two | gawk '{ print gensub(/(part) (one|two|three)/, "\\2", "g") }'
two

The flag is either a string beginning with g or G, in which case the substitution happens globally, or it is a number indicating that the nth occurrence should be replaced.

$ echo a b c a b c a b c | gawk '{ print gensub(/a/, "AA", 2) }'
a b c AA b c a b c

The fourth argument is the string in which to make the change. Unlike sub() and gsub(), the target string is not changed. Instead, the new string is the return value from gensub().

$ gawk '
BEGIN { old = "hello, world"
        new = gensub(/hello/, "goodbye", 1, old)
        printf("<%s>, <%s>\n", old, new)
}'
<hello, world>, <goodbye, world>

11.2.3.11. Time management for programmers

Awk programs are very often used for processing the log files produced by various programs. Often, each record in a log file contains a timestamp, indicating when the record was produced. For both conciseness and precision, the timestamp is written as the result of the UNIX time(2) system call, which is the number of seconds since midnight, January 1, 1970 UTC. (This date is often referred to as "the Epoch.") To make it easier to generate and process log file records with these kinds of timestamps in them, gawk has two functions, systime() and strftime().

The systime() function is primarily intended for generating timestamps to go into log records. Suppose, for example, that we use an awk script to respond to CGI queries to our WWW server. We might log each query to a log file.

{
...
printf("%s:%s:%d\n", User, Host, systime()) >> "/var/log/cgi/querylog"
...
}

Such a record might look like

arnold:some.domain.com:831322007

The strftime() function [78] makes it easy to turn timestamps into human-readable dates. The format string is similar to the one used by sprintf(); it consists of literal text mixed with format specifications for different components of date and time.

[78]This function is patterned after the function of the same name in ANSI C.

$ gawk 'BEGIN { print strftime("Today is %A, %B %d, %Y") }'
Today is Sunday, May 05, 1996

The list of available formats is quite long. See your local strftime(3) manpage, and the gawk documentation for the full list. Our hypothetical CGI log file might be processed by this program:

# cgiformat --- process CGI logs
# data format is user:host:timestamp
#1
BEGIN {	FS = ":"; SUBSEP = "@" }

#2
{
# make data more obvious
	user = $1; host = $2; time = $3
# store first contact by this user
	if (! ((user, host) in first))
		first[user, host] = time
# count contacts
	count[user, host]++
# save last contact
	last[user, host] = time
}

#3
END {
# print the results
	for (contact in count) {
		i = strftime("%y-%m-%d %H:%M", first[contact])
		j = strftime("%y-%m-%d %H:%M", last[contact])
		printf "%s -> %d times between %s and %s\n",
			contact, count[contact], i, j
	}
}

The first step is to set FS to ":" to split the field correctly. We also use a neat trick and set the subscript separator to "@", so that the arrays become indexed by "user@host" strings.

In the second step, we look to see if this is the first time we've seen this user. If so (they're not in the first array), we add them. Then we increment the count of how many times they've connected. Finally we store this record's timestamp in the last array. This element keeps getting overwritten each time we see a new connection by the user. That's OK; what we will end up with is the last (most recent) connection stored in the array.

The END procedure formats the data for us. It loops through the count array, formatting the timestamps in the first and last arrays for printing. Consider a log file with the following records in it.

$ cat /var/log/cgi/querylog
arnold:some.domain.com:831322007
mary:another.domain.org:831312546
arnold:some.domain.com:831327215
mary:another.domain.org:831346231
arnold:some.domain.com:831324598

Here's what running the program produces:

$ gawk -f cgiformat.awk /var/log/cgi/querylog
mary@another.domain.org -> 2 times between 96-05-05 12:09 and 96-05-05 21:30
arnold@some.domain.com -> 3 times between 96-05-05 14:46 and 96-05-05 15:29

11.2. Freely Available awks

11.2.1. Common Extensions

11.2.1.1. Deleting all elements of an array

11.2.1.2. Obtaining individual characters

11.2.1.3. Flushing buffered output

11.2.1.4. Special filenames

Table 11.4. Special Filenames

A printerr() function

11.2.1.5. The nextfile statement

11.2.1.6. Regular expression record separators (gawk and mawk)

11.2.2. Bell Labs awk