B.2 Language Summary for awkThis section summarizes how awk processes input records and describes the various syntactic elements that make up an awk program. B.2.1 Records and FieldsEach line of input is split into fields. By default, the field delimiter is one or more spaces and/or tabs. You can change the field separator by using the -F command-line option. Doing so also sets the value of FS . The following command-line changes the field separator to a colon: awk -F: -f awkscr /etc/passwd You can also assign the delimiter to the system variable FS . This is typically done in the BEGIN procedure, but can also be passed as a parameter on the command line. awk -f awkscr FS=: /etc/passwd Each input line forms a record containing any number of fields. Each field can be referenced by its position in the record. "$1" refers to the value of the first field; "$2" to the second field, and so on. "$0" refers to the entire record. The following action prints the first field of each input line: { print $1 } The default record separator is a newline. The following procedure sets FS and RS so that awk interprets an input record as any number of lines up to a blank line, with each line being a separate field. BEGIN { FS = "\n"; RS = "" } It is important to know that when RS is set to the empty string, newline always separates fields, in addition to whatever value FS may have. This is discussed in more detail in both The AWK Programming Language and Effective AWK Programming . B.2.2 Format of a ScriptAn awk script is a set of pattern-matching rules and actions :
An action is one or more statements that will be performed on those input lines that match the pattern. If no pattern is specified, the action is performed for every input line. The following example uses the print statement to print each line in the input file: { print } If only a pattern is specified, then the default action consists of the print statement, as shown above. Function definitions can also appear:
This syntax defines the function name , making available the list of parameters for processing in the body of the function. Variables specified in the parameter-list are treated as local variables within the function. All other variables are global and can be accessed outside the function. When calling a user-defined function, no space is permitted between the name of the function and the opening parenthesis. Spaces are allowed in the function's definition. User-defined functions are described in Chapter 9, Functions . B.2.2.1 Line terminationA line in an awk script is terminated by a newline or a semicolon. Using semicolons to put multiple statements on a line, while permitted, reduces the readability of most programs. Blank lines are permitted between statements. Program control statements (do , if , for , or while ) continue on the next line, where a dependent statement is listed. If multiple dependent statements are specified, they must be enclosed within braces. if (NF > 1) { name = $1 total += $2 } You cannot use a semicolon to avoid using braces for multiple statements. You can type a single statement over multiple lines by escaping the newline with a backslash (\). You can also break lines following any of the following characters: , { && || Gawk also allows you to continue a line after either a "?" or a ":". Strings cannot be broken across a line (except in gawk, using "\" followed by a newline). B.2.2.2 CommentsA comment begins with a "#" and ends with a newline. It can appear on a line by itself or at the end of a line. Comments are descriptive remarks that explain the operation of the script. Comments cannot be continued across lines by ending them with a backslash. B.2.3 PatternsA pattern can be any of the following:
B.2.4 Regular ExpressionsTable 13.2 summarizes the regular expressions as described in Chapter 3 . The metacharacters are listed in order of precedence.
Regular expressions can also make use of the escape sequences for accessing special characters, as defined in the section "Escape sequences" later in this appendix. Note that ^ and $ work on strings ; they do not match against newlines embedded in a record or string. Within a pair of brackets, POSIX allows special notations for matching non-English characters. They are described in Table 13.3 .
Note that these facilities (as of this writing) are still not widely implemented. B.2.5 ExpressionsAn expression can be made up of constants, variables, operators and functions. A constant is a string (any sequence of characters) or a numeric value. A variable is a symbol that references a value. You can think of it as a piece of information that retrieves a particular numeric or string value. B.2.5.1 ConstantsThere are two types of constants, string and numeric. A string constant must be quoted while a numeric constant is not. B.2.5.2 Escape sequencesThe escape sequences described in Table 13.4 can be used in strings and regular expressions.
B.2.5.3 VariablesThere are three kinds of variables: user-defined, built-in, and fields. By convention, the names of built-in or system variables consist of all capital letters. The name of a variable cannot start with a digit. Otherwise, it consists of letters, digits, and underscores. Case is significant in variable names. A variable does not need to be declared or initialized. A variable can contain either a string or numeric value. An uninitialized variable has the empty string ("") as its string value and 0 as its numeric value. Awk attempts to decide whether a value should be processed as a string or a number depending upon the operation. The assignment of a variable has the form:
It assigns the value of the expression to var . The following expression assigns a value of 1 to the variable x . x = 1 The name of the variable is used to reference the value: { print x } prints the value of the variable x . In this case, it would be 1. See the section "System Variables" below for information on built-in variables. A field variable is referenced using $ n , where n is any number 0 to NF , that references the field by position. It can be supplied by a variable, such as $NF meaning the last field, or constant, such as $1 meaning the first field. B.2.5.4 ArraysAn array is a variable that can be used to store a set of values. The following statement assigns a value to an element of an array:
In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number. An associative array makes an "association" between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array. You can use the special for loop to read all the elements of an associative array.
The index of the array is available as item , while the value of an element of the array can be referenced as array [item ]. You can use the operator in to test that an element exists by testing to see if its index exists.
tests that array [index ] exists, but you cannot use it to test the value of the element referenced by array [index ]. You can also delete individual elements of the array using the delete statement. B.2.5.5 System variablesAwk defines a number of special variables that can be referenced or reset inside a program, as shown in Table 13.5 (defaults are listed in parentheses).
B.2.5.6 OperatorsTable 13.6 lists the operators in the order of precedence (low to high) that are available in awk.
B.2.6 Statements and FunctionsAn action is enclosed in braces and consists of one or more statements and/or expressions. The difference between a statement and a function is that a function returns a value, and its argument list is specified within parentheses. (The formal syntactical difference does not always hold true: printf is considered a statement, but its argument list can be put in parentheses; getline is a function that does not use parentheses.) Awk has a number of predefined arithmetic and string functions. A function is typically called as follows:
where return is a variable created to hold what the function returns. (In fact, the return value of a function can be used anywhere in an expression, not just on the right-hand side of an assignment.) Arguments to a function are specified as a comma-separated list. The left parenthesis follows after the name of the function. (With built-in functions, a space is permitted between the function name and the parentheses.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|