B.2. Language Summary for awk
This section summarizes how awk processes input records and
describes the various syntactic elements that make up an awk program.
B.2.3. Patterns
A pattern can be any of the following:
/regular expression/
relational expression
BEGIN
END
pattern, pattern
Regular expressions use the extended set of metacharacters and must be
enclosed in slashes. For a full discussion of regular expressions,
see Chapter 3, "Understanding Regular Expression Syntax".
Relational expressions use the relational operators
listed under "Expressions" later in this chapter. The BEGIN pattern is applied before the first line
of input is read and the END pattern is applied
after the last line of input is read. Use ! to negate the match; i.e., to handle
lines not matching the pattern. You can address a range of lines, just as in sed:
pattern, pattern
Patterns, except BEGIN and END,
can be expressed in compound forms using the following operators:
&& |
Logical And |
|| |
Logical Or |
Sun's version of nawk (SunOS 4.1.x) does not support treating regular
expressions as parts of a larger Boolean expression. E.g.,
"/cute/ && /sweet/" or "/fast/ || /quick/"
do not work.
In addition the C conditional operator ?:
(pattern ? pattern :
pattern) may be used in a pattern. Patterns can be placed in parentheses to ensure proper evaluation. BEGIN and END patterns must be
associated with actions. If multiple BEGIN and
END rules are written, they are merged into a
single rule before being applied.
B.2.4. Regular Expressions
Table B.1 summarizes the regular expressions
as described in Chapter 3, "Understanding Regular Expression Syntax". The metacharacters are
listed in order of precedence.
Table B.1. Regular Expression Metacharacters
Special |
|
Characters |
Usage |
c |
Matches any literal character c that is not a
metacharacter. |
\ |
Escapes any metacharacter that follows, including itself. |
^ |
Anchors following regular expression to the beginning of string. |
$ |
Anchors preceding regular expression to the end of string. |
. |
Matches any single character, including newline. |
[...] |
Matches any one of the class of characters
enclosed between the brackets. A circumflex (^) as the first
character inside brackets reverses the match to all characters except
those listed in the class. A hyphen (-) is used to indicate a range
of characters. The close bracket (]) as the first
character in a class is a member of the class. All other
metacharacters lose their meaning when specified as members of a
class, except \, which can be used to escape ], even if it is not
first. |
r1|r2 |
Between two regular expressions, r1 and
r2, it allows either of the regular
expressions to be matched. |
(r1)(r2) |
Used for concatenating regular expressions. |
r* |
Matches any number (including zero) of the regular expression
that immediately precedes it. |
r+ |
Matches one or more occurrences of the preceding regular expression. |
r? |
Matches 0 or 1 occurrences of the preceding regular expression. |
(r) |
Used for grouping regular expressions. |
Regular expressions can also make use of the escape sequences for
accessing special characters, as defined in Section B.2.5.2 later in this appendix.
Note that ^ and $ work on
strings; they do not match against newlines
embedded in a record or string.
Within a pair of brackets, POSIX allows special notations for
matching non-English characters. They are described in
Table B.2.
Table B.2. POSIX Character List Facilities
Notation |
Facility |
[.symbol.] |
Collating symbols. A collating symbol is a multi-character sequence
that should be treated as a unit. |
[=equiv=] |
Equivalence classes. An equivalence class lists a set of characters
that should be considered equivalent, such as "e" and "è". |
[:class:] |
Character classes. Character class keywords describe different
classes of characters such as alphabetic characters, control
characters, and so on. |
[:alnum:] |
Alphanumeric characters |
[:alpha:] |
Alphabetic characters |
[:blank:] |
Space and tab characters |
[:cntrl:] |
Control characters |
[:digit:] |
Numeric characters |
[:graph:] |
Printable and visible (non-space) characters |
[:lower:] |
Lowercase characters |
[:print:] |
Printable characters |
[:punct:] |
Punctuation characters |
[:space:] |
Whitespace characters |
[:upper:] |
Uppercase characters |
[:xdigit:] |
Hexadecimal digits |
Note that these facilities (as of this writing) are still not
widely implemented.
B.2.5. Expressions
An expression can be made up of constants, variables, operators and
functions. A constant is a string (any sequence of characters) or a
numeric value. A variable is a symbol that references a value. You
can think of it as a piece of information that retrieves a particular
numeric or string value.
B.2.5.3. Variables
There are three kinds of variables: user-defined, built-in, and
fields. By convention, the names of built-in or system variables
consist of all capital letters.
The name of a variable cannot start with a digit.
Otherwise, it consists of letters, digits, and underscores.
Case is significant in variable names.
A variable does not need to be declared or initialized. A variable
can contain either a string or numeric value. An uninitialized
variable has the empty string ("") as its string value and 0
as its numeric value. Awk attempts to decide whether a value should
be processed as a string or a number depending upon the operation.
The assignment of a variable has the form:
var = expr
It assigns the value of the expression to
var. The following expression assigns a
value of 1 to the variable x.
x = 1
The name of the variable is used to reference the value:
{ print x }
prints the value of the variable x. In this case,
it would be 1.
See the later Section 2.2.5.5 for information on
built-in variables. A field variable is referenced using
$n, where
n is any number 0 to NF,
that references the field by position. It can be supplied by a
variable, such as $NF meaning the last field, or
constant, such as $1 meaning the first field.
B.2.5.5. System variables
Awk defines a number of special variables that can be referenced or
reset inside a program, as shown in Table B.4 (defaults are listed in parentheses).
Table B.4. Awk System Variables
Variable |
Description |
ARGC |
Number of arguments on command line |
ARGV |
An array containing the command-line arguments |
CONVFMT |
String conversion format for numbers (%.6g). (POSIX) |
ENVIRON |
An associative array of environment variables |
FILENAME |
Current filename |
FNR |
Like NR, but relative to the current file |
FS |
Field separator (a blank) |
NF |
Number of fields in current record |
NR |
Number of the current record |
OFMT |
Output format for numbers (%.6g) |
OFS |
Output field separator (a blank) |
ORS |
Output record separator (a newline) |
RLENGTH |
Length of the string matched by match() function |
RS |
Record separator (a newline) |
RSTART |
First position in the string matched by match() function |
SUBSEP |
Separator character for array subscripts (\034) |
B.2.5.6. Operators
Table B.5 lists the operators
in the order of precedence (low to high) that are available in awk.
Table B.5. Operators
Operators |
Description |
= += -= *= /= %= ^= **= |
Assignment |
?: |
C conditional expression |
|| |
Logical OR |
&& |
Logical AND |
~ !~ |
Match regular expression and negation |
< <= > >= != == |
Relational operators |
(blank) |
Concatenation |
+ - |
Addition, subtraction |
* / % |
Multiplication, division, and modulus |
+ - ! |
Unary plus and minus, and logical negation |
^ ** |
Exponentiation |
++ -- |
Increment and decrement, either prefix or postfix |
$ |
Field reference |
NOTE:
While "**" and "**=" are common extensions, they are not
part of POSIX awk.
 |  |  | B. Quick Reference
for awk |  | B.3. Command Summary for awk |
Copyright © 2003 O'Reilly & Associates. All rights reserved.
|