|
Chapter 11 A Flock of awks
|
|
There are also several commercial versions of awk.
In this section, we review the ones that we know about.
Mortice Kern Systems (MKS) in Waterloo, Ontario (Canada)[9]
supplies
awk as part of the MKS Toolkit for MS-DOS/Windows, OS/2, Windows 95,
and Windows NT.
The MKS version implements POSIX awk. It has the following extensions:
-
The
exp()
,
int()
,
log()
,
sqrt()
,
tolower()
, and
toupper()
functions use
$0
if given no
argument.
-
An additional function
ord()
is available. This function takes
a string argument, and returns the numeric value of the first character
in the string. It is similar to the function of the same name in Pascal.
Thompson Automation Software[10]
makes a version of awk (tawk)[11]
for MS-DOS/Windows, Windows 95 and NT, and Solaris.
Tawk is interesting on several counts.
First, unlike other versions of awk, which are interpreters, tawk
is a compiler.
Second, tawk comes with a screen-oriented debugger, written in awk!
The source for the debugger is included.
Third, tawk allows you to link your compiled program with arbitrary
functions written in C.
Tawk has received rave reviews in the
comp.lang.awk
newsgroup.
Tawk comes with an
awk
interface that acts like POSIX awk,
compiling and running your program.
You can, however, compile your program into a standalone executable file.
The tawk compiler actually compiles into a compact intermediate form.
The intermediate representation is linked with a library that executes
the program when it is run, and it is at link time that other C routines
can be integrated with the awk program.
Tawk is a very full-featured implementation of awk. Besides implementing
the features of POSIX awk (based on new awk), it extends the language
in some fundamental ways, and also has a very large number of built-in
functions.
This section provides a "laundry list" of the new features in tawk.
A full treatment of them is beyond the scope of this book; the tawk
documentation does a nice job of presenting them.
Hopefully, by now you should be familiar enough with awk that the value of
these features will be apparent.
Where relevant, we'll contrast the tawk feature with a comparable
feature in gawk.
-
Additional special patterns,
INIT
,
BEGINFILE
, and
ENDFILE
.
INIT
is like
BEGIN
, but the
actions in its procedure are run before[12]
those of the
BEGIN
procedure.
BEGINFILE
and
ENDFILE
provide you the ability to have per-file
start-up and clean-up actions.
Unlike using a rule based on
FNR == 1
, these actions are executed even when files are empty.
-
Controlled regular expressions. You can add a flag to a regular
expression ("/match me/") that tells tawk how to treat the regular expression.
An
i
flag ("/match me/i") indicates
that case should be ignored when doing matching.
An
s
flag indicates that the shortest possible
text should be matched, instead of the longest.
-
An
abort
[
expr
] statement. This is similar to
exit
,
except that tawk exits immediately, bypassing any
END
procedure.
The
expr
, if provided, becomes the return value from tawk to its
parent program.
-
True multidimensional arrays. Conventional awk simulates multidimensional
arrays by concatenating the values of the subscripts, separated by the
value of
SUBSEP
, to generate a (hopefully) unique index in a regular
associative array. While implementing this feature for compatibility, tawk
also provides true multidimensional arrays.
a[1][1] = "hello"
a[1][2] = "world"
for (i in a[1])
print a[1][i]
Multidimensional arrays guarantee that the indices will be unique, and
also have the potential for greater performance when the number of elements
gets to be very large.
-
Automatic sorting of arrays. When looping over every element of an array using
the
for (item in array)
construct, tawk will first sort the indices
of the array, so that array elements are processed in order. You can
control whether this sorting is turned on or off, and if on, whether the
sorting is numeric or alphabetic, and in ascending or descending order.
While the sorting incurs a performance penalty, it is likely to be less
than the overhead of sorting the array yourself using awk code, or piping
the results into an external invocation of
sort
.
-
Scope control for functions and variables.
You can declare that functions and variables are global to an
entire program, global within a "module" (source file), local to
a module, and local to a function. Regular awk only gives you
global variables, global functions, and extra function parameters,
which act as local variables.
This feature is a very nice one, making it much easier to write
libraries of awk functions without having to worry about variable names
inadvertently conflicting with those in other library functions or in
the user's main program.
-
RS
can be a regular expression. This is similar to gawk and mawk;
however, the regular expression cannot be one that requires more than
one character of look-ahead. The text that matched
RS
is saved
in the variable
RSM
(record separator match), similar to
gawk's
RT
variable.
-
Describing fields, instead of the field separators.
The variable
FPAT
can be a regular expression that describes the
contents of the fields. Successive occurrences of text that matches
FPAT
become the contents of the fields.
-
Controlling the implicit file processing loop.
The variable
ARGI
tracks the position in
ARGV
of the
current input data file. Unlike gawk's
ARGIND
variable,
assigning a value to
ARGI
can be used to make tawk skip over
input data files.
-
Fixed-length records. By assigning a value to the
RECLEN
variable,
you can make tawk read records in fixed-length chunks. If
RS
is
not matched within
RECLEN
characters, then tawk returns a record
that is
RECLEN
characters long.
-
Hexadecimal constants.
You can specify C-style hexadecimal constants (
0xDEAD
and
0xBEEF
being two rather famous ones) in tawk programs.
This helps when using the built-in bit manipulation functions
(see the next section).
Whew! That's a rather long list, but these features bring
additional power to programming in awk.
Besides extending the language, tawk provides a large number of
additional built-in functions.
Here is another "laundry list," this time of the different classes of
functions available. Each class has two or more functions associated
with it. We'll briefly describe the functionality of each class.
-
Extended string functions. Extensions to the standard string functions
and new string functions
allow you to match and substitute for subpatterns within patterns
(similar to gawk's
gensub()
function), assign to substrings within
strings, and split a string into an array based on a pattern that
matches elements, instead of the separator. There are additional
printf
formats, and string translation functions.
While undoubtedly some of these functions could be written as
user-defined functions, having them built in provides greater performance.
-
Bit manipulation functions. You can perform bitwise AND, OR, and XOR
operations on (integer) values.
These could also be written as user-defined functions, but with a loss
of performance.
-
More I/O functions. There is a suite of functions modeled after those
in the
stdio
(3) library. In particular, the ability to seek within
a file, and do I/O in fixed-size amounts, is quite useful.
-
Directory operation functions. You can make, remove, and change directories,
as well as remove and rename files.
-
File information functions. You can retrieve file permissions, size, and
modification times.
-
Directory reading functions. You can get the current directory name,
as well as read a list of all the filenames in a directory.
-
Time functions. There are functions to retrieve the current time of day,
and format it in various ways. These functions are not quite as flexible
as gawk's
strftime()
function.
-
Execution functions. You can sleep for a specific amount of time, and
start other functions running. Tawk's
spawn()
function is
interesting because it allows you to provide values for the new
program's environment, and also indicate whether the program should or
should not run asynchronously.
This is particularly valuable on non-UNIX systems, where the command
interpreters (such as MS-DOS's
command.com
) are quite limited.
-
File locking. You can lock and unlock files and ranges within files.
-
Screen functions. You can do screen-oriented I/O. Under UNIX,
these functions are implemented on top of the
curses
(3) library.
-
Packing and unpacking of binary data. You can specify how binary data
structures are laid out. This, together with the new I/O functions, makes
it possible to do binary I/O, something you would normally have to do in C
or C++.
-
Access to internal state. You can get or set the value of any awk variable
through function calls.
-
Access to MS-DOS low-level facilities. You can use system interrupts,
and peek and poke values at memory addresses. These features are
obviously for experts only.
From this list, it becomes clear that tawk provides a nice alternative
to C and to Perl for serious programming tasks.
As an example, the
screen functions and internal state functions are used to
implement the tawk debugger in awk.
Videosoft[13]
sells software called VSAwk that brings awk-style
programming into the Visual Basic environment.
VSAwk is a Visual Basic
control that works in an event driven fashion.
Like awk, VSAwk gives you startup and cleanup actions, and splits
the input record into fields,
as well as the ability to write
expressions and call the awk built-in functions.
VSAwk resembles UNIX awk mostly in its data processing model, not
its syntax.
Nevertheless, it's interesting to see how people apply the concepts
from awk to the environment provided by a very different language.
|