8.4.1. Associative Arrays
In awk, all arrays are associative arrays. What
makes an associative array unique is that its index can be a string or
a number.
In most programming languages, the indices of arrays are exclusively
numeric. In these implementations, an array is a sequence of
locations where values are stored. The indices of the array are
derived from the order in which the values are stored. There is no
need to keep track of indices. For instance, the index of the first
element of an array is "1" or the first location in the array.
An associative array makes an "association" between the indices and
the elements of an array. For each element of the array, a pair of
values is maintained: the index of the element and the value of the
element. The elements are not stored in any particular order as in a
conventional array. Thus, even though you can use numeric subscripts
in awk, the numbers do not have the same meaning that they do in other
programming languages--they do not necessarily refer to
sequential locations. However, with numeric indices, you can still
access all the elements of an array in sequence, as we did in previous
examples. You can create a loop to increment a counter that
references the elements of the array in order.
Sometimes, the distinction between numeric and string indices is
important. For instance, if you use "04" as the index to an element
of the array, you cannot reference that element using "4" as its
subscript. You'll see how to handle this problem in a sample program
date-month, shown later in this chapter.
Associative arrays are a distinctive feature of awk, and a very
powerful one that allows you to use a string as an index to another
value. For instance, you could use a word as the index to its
definition. If you know the word, you can retrieve the definition.
For example, you could use the first field of the input line as the
index to the second field with the following assignment:
array[$1] = $2
Using this technique, we could take our list of acronyms and load it
into an array named acro.
acro[$1] = $2
Each element of the array would be the description of an acronym and
the subscript used to retrieve the element would be the acronym
itself. The following expression:
acro["BASIC"]
produces:
Beginner's All-Purpose Symbolic Instruction Code
There is a special looping syntax for accessing all the elements of an
associative array. It is a version of the for
loop.
for ( variable in array )
do something with array[variable]
The array is the name of an array, as it
was defined. The variable is any variable,
which you can think of as a temporary variable similar to a counter
that is incremented in a conventional for loop.
This variable is set to a particular subscript each time through the
loop. (Because variable is an arbitrary
name, you often see item used, regardless of what
variable name was used for the subscript when the array was loaded.)
For example, the following for loop prints the name
of the acronym item and the definition referenced by that name,
acro[item].
for ( item in acro )
print item, acro[item]
In this example, the print statement prints the current subscript
("BASIC," for instance) followed by the element of the
acro array referenced by the subscript ("Beginner's
All-Purpose Symbolic Instruction Code").
This syntax can be applied to arrays with numeric subscripts.
However, the order in which the items are retrieved is somewhat
random.[56]
The order is very likely to vary among awk implementations; be careful
to write your programs so that they don't depend on any one version of
awk.
It is important to remember that all array indices in awk are strings.
Even when you use a number as an index, awk automatically converts it
to a string first. You don't have to worry about this when you use
integer indices, since they get converted to strings as integers, no
matter what the value may be of OFMT (original awk
and earlier versions of new awk) or CONVFMT (POSIX
awk). But if you use a real number as an index, the number to string
conversion might affect you. For instance:
$ gawk 'BEGIN { data[1.23] = "3.21"; CONVFMT = "%d"
> printf "<%s>\n", data[1.23] }'
<>
Here, nothing was printed between the angle brackets, since the second
time, 1.23 was converted to just
1, and data["1"] has the empty
string as its value.
NOTE:
Not all implementations of awk get the number to
string conversion right when CONVFMT has changed
between one use of a number and the next. Test the above example with
your awk to be sure it works correctly.
Now let's return to our student grade program for an example. Let's
say that we wanted to report how many students got an "A," how many
got a "B," and so on. Once we determine the grade, we could increment
a counter for that grade. We could set up individual variables for
each letter grade and then test which one to increment.
if ( grade == "A" )
++gradeA
else if (grade == "B" )
++gradeB
.
.
.
However, an array makes this task much easier. We can define an array
called class_grade, and simply use the letter grade
(A through F) as the index to the array.
++class_grade[grade]
Thus, if the grade is an "A" then the value of
class_grade["A"] is incremented by one. At the end
of the program, we can print out these values in the
END rule using the special for
loop:
for (letter_grade in class_grade)
print letter_grade ":", class_grade[letter_grade] | "sort"
The variable letter_grade references a single
subscript of the array class_grade each time
through the loop. The output is piped to sort, to
make sure the grades come out in the proper order. (Piping output to
programs is discussed in Chapter 10, "The Bottom Drawer".) Since this
is the last addition we make to the grades.awk
script, we can look at the full listing.
# grades.awk -- average student grades and determine
# letter grade as well as class averages.
# $1 = student name; $2 - $NF = test scores.
# set output field separator to tab.
BEGIN { OFS = "\t" }
# action applied to all input lines
{
# add up grades
total = 0
for (i = 2; i <= NF; ++i)
total += $i
# calculate average
avg = total / (NF - 1)
# assign student's average to element of array
student_avg[NR] = avg
# determine letter grade
if (avg >= 90) grade = "A"
else if (avg >= 80) grade = "B"
else if (avg >= 70) grade = "C"
else if (avg >= 60) grade = "D"
else grade = "F"
# increment counter for letter grade array
++class_grade[grade]
# print student name, average and letter grade
print $1, avg, grade
}
# print out class statistics
END {
# calculate class average
for (x = 1; x <= NR; x++)
class_avg_total += student_avg[x]
class_average = class_avg_total / NR
# determine how many above/below average
for (x = 1; x <= NR; x++)
if (student_avg[x] >= class_average)
++above_average
else
++below_average
# print results
print ""
print "Class Average: ", class_average
print "At or Above Average: ", above_average
print "Below Average: ", below_average
# print number of students per letter grade
for (letter_grade in class_grade)
print letter_grade ":", class_grade[letter_grade] | "sort"
}
Here's a sample run:
$ cat grades.test
mona 70 77 85 83 70 89
john 85 92 78 94 88 91
andrea 89 90 85 94 90 95
jasper 84 88 80 92 84 82
dunce 64 80 60 60 61 62
ellis 90 98 89 96 96 92
$ awk -f grades.awk grades.test
mona 79 C
john 88 B
andrea 90.5 A
jasper 85 B
dunce 64.5 D
ellis 93.5 A
Class Average: 83.4167
At or Above Average: 4
Below Average: 2
A: 2
B: 2
C: 1
D: 1
8.4.4. Using split() to Create Arrays
The built-in function split() can parse any
string into elements of an array. This function can be useful to
extract "subfields" from a field. The syntax of the
split() function is:
n = split(string, array, separator)
string is the input string to be parsed
into elements of the named array. The
array's indices start at 1 and go to n, the
number of elements in the array. The elements will be split based on
the specified separator character. If a
separator is not specified, then the field separator
(FS) is used. The
separator can be a full regular expression,
not just a single character. Array splitting behaves identically to
field splitting; see Section 7.5.1 in Chapter 7.
For example, if you had a record in which the first field consisted of
the person's full name, you could use the
split() function to extract the person's
first and last names. The following statement breaks up the first
field into elements of the array fullname:
z = split($1, fullname, " ")
A space is specified as the delimiter. The person's first name
can be referenced as:
fullname[1]
and the person's last name can be referenced as:
fullname[z]
because z contains the number of elements in the
array. This works, regardless of whether the person's full name
contains a middle name. If z is the value returned
by split(), you can write a loop to read
all the elements of this array.
z = split($1, array, " ")
for (i = 1; i <= z; ++i)
print i, array[i]
The next section contains additional examples of using the
split() function.
8.4.5. Making Conversions
This section looks at two examples that demonstrate similar methods of
converting output from one format to another.
When working on the index program shown in
Chapter 12, "Full-Featured Applications", we needed a quick way to assign roman
numerals to volume numbers. In other words, volume 4 needed to be
identified as "IV" in the index. Since there was no immediate
prospect of the number of volumes exceeding 10, we wrote a script that
took as input a number between 1 and 10 and converted it to a roman
numeral.
This shell script takes the first argument from the command
line and echoes it as input to the awk program.
echo $1 |
awk '# romanum -- convert number 1-10 to roman numeral
# define numerals as list of roman numerals 1-10
BEGIN {
# create array named numerals from list of roman numerals
split("I,II,III,IV,V,VI,VII,VIII,IX,X", numerals, ",")
}
# look for number between 1 and 10
$1 > 0 && $1 <= 10 {
# print specified element
print numerals[$1]
exit
}
{ print "invalid number"
exit
}'
This script defines a list of 10 roman numerals, then uses
split() to load them into an array named
numerals. This is done in the
BEGIN action because it only needs to be done once.
The second rule checks that the first field of the input line contains
a number between 1 and 10. If it does, this number is used as the
index to the numerals array, retrieving the
corresponding element. The exit statement
terminates the program. The last rule is executed only if there is no
valid entry.
Here's an example of how it works:
$ romanum 4
IV
Following along on the same idea, here's a script that converts dates
in the form "mm-dd-yy" or "mm/dd/yy" to "month day, year."
awk '
# date-month -- convert mm/dd/yy or mm-dd-yy to month day, year
# build list of months and put in array.
BEGIN {
# the 3-step assignment is done for printing in book
listmonths = "January,February,March,April,May,June,"
listmonths = listmonths "July,August,September,"
listmonths = listmonths "October,November,December"
split(listmonths, month, ",")
}
# check that there is input
$1 != "" {
# split on "/" the first input field into elements of array
sizeOfArray = split($1, date, "/")
# check that only one field is returned
if (sizeOfArray == 1)
# try to split on "-"
sizeOfArray = split($1, date, "-")
# must be invalid
if (sizeOfArray == 1)
exit
# add 0 to number of month to coerce numeric type
date[1] += 0
# print month day, year
print month[date[1]], (date[2] ", 19" date[3])
}'
This script reads from standard input. The BEGIN
action creates an array named month whose elements
are the names of the months of the year. The second rule verifies
that we have a non-empty input line. The first statement in the
associated action splits the first field of input looking for
"/" as the delimiter. sizeOfArray contains
the number of elements in the array. If awk was unable to parse the
string, it creates the array with only one element. Thus, we can test
the value of sizeOfArray to determine if we have
several elements. If we do not, we assume that perhaps "-" was used
as the delimiter. If that fails to produce an array with multiple
elements, we assume the input is invalid, and exit. If we have
successfully parsed the input, date[1] contains the
number of the month. This value can be used as the index to the array
month, nesting one array inside another. However,
before using date[1], we coerce the type of
date[1] by adding 0 to it. While awk will
correctly interpret "11" as a number, leading zeros may cause a number
to be treated as a string. Thus, "06" might not be recognized properly
without type coercion. The element referenced by
date[1] is used as the subscript for
month.
Here's a sample run:
$ echo "5/11/55" | date-month
May 11, 1955