Up to this point, we've
shown you tools to do basic batch editing of text files. These tools,
although powerful, have limitations. Although you can script
ex commands, the range of text manipulation is
quite limited. If you need more powerful and flexible batch editing
tools, you need to look at programming languages that are designed
for text manipulation. One of the earliest Unix languages to do this
is awk, created by Al Aho, Peter Weinberger, and
Brian Kernighan. Even if you've never programmed
before, there are some simple but powerful ways that you can use
awk. Whenever you have a text file
that's arranged in columns from which you need to
extract data, awk should come to mind.
For example, every Red Hat Linux system stores its version number in
/etc/redhat-release. On my system, it looks like
this:
Red Hat Linux release 7.1 (Seawolf)
When applying new RPM files to your system, it is often helpful to
know which Red Hat version you're using. On the
command line, you can retrieve just that number with:
awk '{print $5}' /etc/redhat-release
What's going on here? By default,
awk splits each line read from standard input on
whitespace, as is explained below. In effect, it's
like you are looking at one row of a spreadsheet. In spreadsheets,
columns are usually named with letters. In awk,
columns are numbered and you only can see one row (that is, one line
of input) at a time. The Red Hat version number is in the fifth
column. Similar to the way shells use $ for
variable interpolation, the values of columns in
awk are retrieved using variables that start with
$ and are followed by an integer.
As you can guess, this is a fairly simple demostration of
awk, which includes support for regular
expressions, branching and looping, and subroutines. For a more
complete reference on using awk, see
Effective awk Programming or sed &
awk Pocket Reference, both published by
O'Reilly.
Since there are many flavor of
awk, such as nawk and
gawk (Section 18.11), this article tries to
provide a usable reference for the most common elements of the
language. Dialect differences, when they occur, are noted. With the
exception of array subscripts, values in
[ brackets] are optional;
don't type the [ or
].
You can specify a script directly on the
command line, or you can store a script in a
scriptfile and specify it with
-f. In most versions, the -f option
can be used multiple times. The variable
var can be assigned a value on the command
line. The value can be a literal, a shell variable
($name), or a command
substitution
('cmd'),
but the value is available only after a line of input is read (i.e.,
after the BEGIN statement). awk operates on one or
more file(s). If none are specified (or if
- is specified), awk reads from
the standard input (Section 43.1).
The other recognized options are:
-Fc
Set the field separator to character c.
This is the same as setting the system variable
FS. nawk allows
c to be a regular
expression (Section 32.4). Each record (by
default, one input line) is divided into fields by whitespace (blanks
or tabs) or by some other user-definable field separator. Fields are
referred to by the variables $1,
$2, . . .
$n.
$0 refers to the entire record. For example, to
print the first three (colon-separated) fields on separate lines:
Assign a value to variable
var. This allows assignment before the
script begins execution. (Available in nawk only.)
20.10.2. Patterns and Procedures
awkscripts consist of patterns and
procedures:
pattern{procedure}
Both are optional. If pattern is missing,
{procedure}
is applied to all records. If
{procedure}
is missing, the matched record is written to the standard output.
20.10.2.1. Patterns
pattern
can be any of the following:
/regular expression/
relational expressionpattern-matching expression
BEGIN
END
Expressions can be composed of quoted strings, numbers, operators,
functions, defined variables, and any of the predefined variables
described later in Section 20.10.3.
Regular expressions use the extended set of metacharacters, as
described in Section 32.15. In addition,
^ and $ (Section 32.5) can be used to refer to the beginning and end
of a field, respectively, rather than the beginning and end of a
record (line).
Relational expressions use the relational operators listed in Section 20.10.4 later in this
article. Comparisons can be either string or numeric. For example,
$2>$1
selects records for which the second field is greater than the first.
Pattern-matching expressions use the operators ~
(match) and !~ (don't match). See
Section 20.10.4 later in
this article.
The BEGIN pattern lets you specify procedures
that will take place before the first input
record is processed. (Generally, you set global variables here.)
The END pattern lets you specify procedures that will take place
after the last input record is read.
Except for BEGIN and END, patterns can be combined with the
Boolean operators
|| (OR), &&
(AND), and ! (NOT). A range of lines can also be
specified using comma-separated patterns:
pattern,pattern
20.10.2.2. Procedures
procedure
can consist of one or more commands, functions, or variable
assignments, separated by newlines or semicolons
(;), and contained within curly braces
({}).
Commands fall into four groups:
Variable or array assignments
Printing commands
Built-in functions
Control-flow commands
20.10.2.3. Simple pattern-procedure examples
Print the first field of each line:
{ print $1 }
Print all lines that contain pattern:
/pattern/
Print first field of lines that contain pattern:
/pattern/{ print $1 }
Print records containing more than two fields:
NF > 2
Interpret input records as a group of lines up to a blank line:
Print fields 2 and 3 in switched order, but only on lines whose first
field matches the string URGENT:
$1 ~ /URGENT/ { print $3, $2 }
Count and print the number of pattern found:
/pattern/ { ++x }
END { print x }
Add numbers in second column and print total:
{total += $2 };
END { print "column total is", total}
Print lines that contain fewer than 20 characters:
length($0) < 20
Print each line that begins with Name: and that
contains exactly seven fields:
NF == 7 && /^Name:/
20.10.3. awk System Variables
nawk supports all
awk
variables. gawk supports both
nawk and awk.
Version
Variable
Description
awk
FILENAME
Current filename
FS
Field separator (default is whitespace)
NF
Number of fields in current record
NR
Number of the current record
OFMT
Output format for numbers (default is %.6g)
OFS
Output field separator (default is a blank)
ORS
Output record separator (default is a newline)
RS
Record separator (default is a newline)
$0
Entire input record
$n
nth field in current record; fields are
separated by FS
nawk
ARGC
Number of arguments on command line
ARGV
An array containing the command-line arguments
ENVIRON
An associative array of environment variables
FNR
Like NR, but relative to the current file
RSTART
First position in the string matched by match
function
RLENGTH
Length of the string matched by match function
SUBSEP
Separator character for array subscripts (default is
\034)
20.10.4. Operators
This
table lists the operators, in increasing precedence, that are
available in awk.
Symbol
Meaning
= += -= *= /= %= ^=
Assignment (^= only in nawk
and gawk)
?:
C conditional expression (nawk and
gawk)
||
Logical OR
&&
Logical AND
~ !~
Match regular expression and negation
< <= > >= != ==
Relational operators
(blank)
Concatenation
+ -
Addition, subtraction
* / %
Multiplication, division, and modulus
+ - !
Unary plus and minus, and logical negation
^
Exponentiation (nawk and
gawk)
++ --
Increment and decrement, either prefix or postfix
$
Field reference
20.10.5. Variables and Array Assignments
Variables can be assigned a value
with an equal sign (=). For example:
FS = ","
Expressions using the operators +,
-, *, /, and
% (modulus) can be assigned to variables.
Arrays can be created with the
split function (see below), or they can
simply be named in an assignment statement. Array elements can be
subscripted with numbers
(array[1], . . .
,array[n])
or with names (as associative arrays). For example, to
count the number of occurrences of a pattern, you could use the
following script:
/pattern/ { array["pattern"]++ }
END { print array["pattern"] }
The following alphabetical list of statements and functions includes
all that are available in awk,
nawk, or gawk. Unless otherwise
mentioned, the statement or function is found in all versions. New
statements and functions introduced with nawk are
also found in gawk.
atan2
atan2(y,x)
Returns
the arctangent of
y/x in radians.
(nawk)
break
Exit from a while, for, or
do loop.
close
close(filename-expr)close(command-expr)
In some implementations of awk, you can have only
ten files open simultaneously and one pipe; modern
versions allow more than one pipe open. Therefore,
nawk provides a close
statement that allows you to close a file or a pipe.
close takes as an argument the same expression
that opened the pipe or file. (nawk)
continue
Begin next iteration of while,
for, or do loop
immediately.
cos
cos(x)
Return cosine of x (in radians).
(nawk)
delete
deletearray[element]
Delete element of
array. (nawk)
do
dobodywhile
(expr)
Looping statement. Execute statements in
body, then evaluate
expr. If expr
is true, execute body again. More than one
command must be put inside braces
({}). (nawk)
exit
exit[expr]
Do not execute remaining instructions and do not read new input.
END procedure, if any, will be
executed. The expr, if any, becomes
awk's exit
status (Section 34.12).
exp
exp(arg)
Return the natural exponent of arg.
for
for
([init-expr];
[test-expr];
[incr-expr])command
C-language-style looping construct. Typically,
init-expr assigns the initial value of a
counter variable. test-expr is a
relational expression that is evaluated each time before executing
the command. When
test-expr is false, the loop is exited.
incr-expr is used to increment the counter
variable after each pass. A series of
commands must be put within braces
({}). For example:
for (i = 1; i <= 10; i++)
printf "Element %d is %s.\n", i, array[i]
for
for (iteminarray)command
For each item in an associative
array, do
command. More than one
command must be put inside braces
({}). Refer to each element of the array as
array[item].
getline
getline
[var][<file]
or command| getline
[var]
Read next line of input. Original awk does not
support the syntax to open multiple input streams. The first form
reads input from file, and the second form
reads the standard output of a Unix
command. Both forms read one line at a
time, and each time the statement is executed, it gets the next line
of input. The line of input is assigned to $0, and
it is parsed into fields, setting NF,
NR, and FNR. If
var is specified, the result is assigned
to var and the $0 is
not changed. Thus, if the result is assigned to a variable, the
current line does not change. getline is
actually a function, and it returns 1 if it reads a record
successfully, 0 if end-of-file is encountered, and -1 if for some
reason it is otherwise unsuccessful. (nawk)
gsub
gsub(r,s[,t])
Globally substitute s for each match of
the regular expression
r in the string
t. Return the number of substitutions. If
t is not supplied, defaults to
$0. (nawk)
if
if
(condition)command[elsecommand]
If condition is true, do
command(s), otherwise do
command(s) in
else clause (if any).
condition can be an expression that uses
any of the relational operators
<, <=,
==, !=,
>=, or >, as well as the
pattern-matching operators
~ or !~ (e.g., if ($1
~ /[Aa].*[Zz]/)). A series of
commands must be put within braces
({}).
index
index(str,substr)
Return position of first substring substr
in string stror 0 if not found.
int
int(arg)
Return integer value of arg.
length
length(arg)
Return the length of arg.
log
log(arg)
Return the natural logarithm of arg.
match
match(s,r)
Function that matches the pattern, specified by the regular
expression r, in the
string s and returns either the position
in s where the match begins or 0 if no
occurrences are found. Sets the values of RSTART
and RLENGTH. (nawk)
next
Read next input line and start new cycle through pattern/procedures
statements.
print
print [args]
[destination]
Print args on output, followed by a
newline. args is usually
one or more fields, but it may also be one or more of the predefined
variables -- or arbitrary expressions. If no
args are given, prints
$0 (the current input record). Literal strings
must be quoted. Fields are printed in the order they are listed. If
separated by commas (,) in the argument list, they are separated in
the output by the OFS character. If separated by
spaces, they are concatenated in the output.
destination is a Unix redirection or pipe
expression (e.g., >file) that redirects the default
standard output.
printf
printf format [,
expression(s)]
[destination]
Formatted print statement. Fields or variables can be formatted
according to instructions in the format
argument. The number of expressions must
correspond to the number specified in the format sections.
format follows the conventions of the
C-language printf statement. Here are a few of
the most common formats:
%s
A string.
%d
A decimal number.
%n.mf
A floating-point number, where n is the
total number of digits and m is the number
of digits after the decimal point.
%[-]nc
n specifies minimum field length for
format type c, while -
left-justifies value in field; otherwise value is right-justified.
format can also contain embedded escape
sequences: \n (newline) or \t
(tab) are the most common. destination is
a Unix redirection or pipe expression (e.g., >file) that redirects the default
standard output.
For example, using the following script:
{printf "The sum on line %s is %d.\n", NR, $1+$2}
and the following input line:
5 5
produces this output, followed by a newline:
The sum on line 1 is 10.
rand
rand( )
Generate a random number between 0 and 1. This function returns the same
series of numbers each time the script is executed, unless the random
number generator is seeded using the srand( )
function. (nawk)
return
return [expr]
Used at end of user-defined functions to exit the function,
returning value of expression
expr, if any. (nawk)
sin
sin(x)
Return
sine of x (in radians).
(nawk)
split
split(string,array[,sep])
Split string into elements of
arrayarray[1],
. . .
,array[n].
string is split at each occurrence of
separator sep. (In
nawk, the separator may be a regular expression.)
If sep is not specified,
FS is used. The number of array elements created
is returned.
sprintf
sprintf (format [,
expression(s)])
Return the value of expression(s), using
the specified format(see
printf). Data is formatted but not printed.
sqrt
sqrt(arg)
Return square root of arg.
srand
srand(expr)
Use expr to
set a new seed for random number generator. Default is time of day.
Returns the old seed. (nawk)
sub
sub(r,s[,t])
Substitute s for first match of the
regular expression
r in the string
t. Return 1 if successful; 0 otherwise. If
t is not supplied, defaults to
$0. (nawk)
substr
substr(string,m[,n])
Return substring of string, beginning at
character position m and
consisting of the next n characters. If
n is omitted, include all characters to
the end of string.
system
system(command)
Function that executes the specified Unix
command and returns its
status
(Section 34.12). The status of the command that is
executed typically indicates its success (0) or failure (nonzero).
The output of the command is not available for processing within the
nawk script. Use
command|getline to read the output of the command into the
script. (nawk)
tolower
tolower(str)
Translate all uppercase characters in str to
lowercase and return the new string. (nawk)
toupper
toupper(str)
Translate all lowercase characters in str
to uppercase and return the new string. (nawk)
while
while (condition)
command
Do command while
condition is true (see
if for a description of allowable conditions). A
series of commands must be put within braces
({}).
-- DG
20.9. patch: Generalized Updating of Files That Differ