12.2.5. Merging Entries with the Same Keys
The pagenums.idx program reduced entries that were
the same except for the page number. Now we'll to process
entries that share the same primary key. We also want to look for
consecutive page numbers and combine them in ranges.
The combine.idx is quite similar to the
pagenums.idx script, making another pass through
the index, comparing entries with the same primary key. The following
pseudocode abstracts this comparison. (To make this discussion easier, we
will omit tertiary keys and show how to compare primary and secondary
keys.) After the entries are processed by
pagenums.idx, no two entries exist that share the
same primary and secondary keys. Therefore, we don't have to compare
secondary keys.
PRIMARY = $1
SECONDARY = $2
PAGE = $3
if (PRIMARY == prevPRIMARY)
print :SECONDARY:
else
print PRIMARY:SECONDARY
prevPRIMARY = PRIMARY
prevSECONDARY = SECONDARY
If the primary keys match, we output only the secondary
key. For instance, if there are three entries:
XView:18
XView:about:3, 7, 10
XView:as object-oriented system:17
they will be output as:
XView:18
:about:3, 7, 10
:as object-oriented system:17
We drop the primary key when it is the same. The actual code is a
little more difficult because there are tertiary keys. We have to
test primary and secondary keys to see if they are unique or the same,
but we don't have to test tertiary keys. (We only need to know that
they are there.)
You no doubt noticed that the above pseudocode does not output page
numbers. The second role of this script is to examine page numbers
and combine a list of consecutive numbers. The page numbers are a
comma-separated list that can be loaded into an array, using the
split() function.
To see if numbers are consecutive, we loop through the array comparing
each element to 1 + the previous element.
eachpage[j-1]+1 == eachpage[j]
In other words, if adding 1 to the previous element produces the
current element, then they are consecutive. The previous element
becomes the first page number in the range and the current element
becomes the last page in the range. This is done within a
while loop until the conditional is not true, and
the page numbers are not consecutive. Then we output the first page
number and the last page number separated by a hyphen:
23-25
The actual code looks more complicated than this because it is called
from a function that must recognize volume and page number pairs. It
first has to split the volume from the list of page numbers and then
it can call the function (rangeOfPages())
to process the list of numbers.
Here is the full listing of combine.idx:
#!/work/bin/nawk -f
# ------------------------------------------------
# combine.idx -- merge keys with same PRIMARY key
# and combine consecutive page numbers
# Author: Dale Dougherty
# Version 1.1 7/10/90
#
# input should be PRIMARY:SECONDARY:PAGELIST
# ------------------------------------------------
BEGIN { FS = ":"; OFS = ""}
# main routine -- applies to all input lines
# It compares the keys and merges the duplicates.
{
# assign first field
PRIMARY=$1
# split second field, getting SEC and TERT keys.
sizeOfArray = split($2, array, ";")
SECONDARY = array[1]
TERTIARY = array[2]
# test that tertiary key exists
if (sizeOfArray > 1) {
# tertiary key exists
isTertiary = 1
# two cases where ";" might turn up
# check SEC key for list of "see also"
if (SECONDARY ~ /\([sS]ee also/){
SECONDARY = $2
isTertiary = 0
}
# check TERT key for "see also"
if (TERTIARY ~ /\([sS]ee also/){
TERTIARY = substr($2, (index($2, ";") + 1))
}
}
else # tertiary key does not exist
isTertiary = 0
# assign third field
PAGELIST = $3
# Conditional to compare primary key of this entry to that
# of previous entry. Then compare secondary keys. This
# determines which non-duplicate keys to output.
if (PRIMARY == prevPrimary) {
if (isTertiary && SECONDARY == prevSecondary)
printf (";\n::%s", TERTIARY)
else
if (isTertiary)
printf ("\n:%s; %s", SECONDARY, TERTIARY)
else
printf ("\n:%s", SECONDARY)
}
else {
if (NR != 1)
printf ("\n")
if ($2 != "")
printf ("%s:%s", PRIMARY, $2)
else
printf ("%s", PRIMARY)
prevPrimary = PRIMARY
}
prevSecondary = SECONDARY
} # end of main procedure
# routine for "See" entries (primary key only)
NF == 1 { printf ("\n") }
# routine for all other entries
# It handles output of the page number.
NF > 1 {
if (PAGELIST)
# calls function numrange() to look for
# consecutive page numbers.
printf (":%s", numrange(PAGELIST))
else
if (! isTertiary || (TERTIARY && SECONDARY)) printf (":")
} # end of NF > 1
# END procedure outputs newline
END { printf ("\n") }
# Supporting Functions
# numrange -- read list of Volume^Page numbers, detach Volume
# from Page for each Volume and call rangeOfPages
# to combine consecutive page numbers in the list.
# PAGE = volumes separated by semicolons; volume and page
# separated by ^.
function numrange(PAGE, listOfPages, sizeOfArray)
{
# Split up list by volume.
sizeOfArray = split(PAGE, howManyVolumes,";")
# Check to see if more than 1 volume.
if (sizeOfArray > 1) {
# if more than 1 volume, loop through list
for (i = 1; i <= sizeOfArray; ++i) {
# for each Volume^Page element, detach Volume
# and call rangeOfPages function on Page to
# separate page numbers and compare to find
# consecutive numbers.
if (split(howManyVolumes[i],volPage,"^") == 2)
listOfPages = volPage[1] "^"
rangeOfPages(volPage[2])
# collect output in listOfPages
if (i == 1)
result = listOfPages
else
result=result ";" listOfPages
} # end for loop
}
else { # not more than 1 volume
# check for single volume index with volume number
# if so, detach volume number.
# Both call rangeOfPages on the list of page numbers.
if (split(PAGE,volPage,"^") == 2 )
# if Volume^Page, detach volume and then call rangeOfPages
listOfPages = volPage[1] "^" rangeOfPages(volPage[2])
else # No volume number involved
listOfPages = rangeOfPages(volPage[1])
result = listOfPages
} # end of else
return result # Volume^Page list
} # End of numrange function
# rangeOfPages -- read list of comma-separated page numbers,
# load them into an array, and compare each one
# to the next, looking for consecutive numbers.
# PAGENUMBERS = comma-separated list of page numbers
function rangeOfPages(PAGENUMBERS, pagesAll, sizeOfArray,pages,
listOfPages, d, p, j) {
# close-up space on troff-generated ranges
gsub(/ - /, ",-", PAGENUMBERS)
# split list up into eachpage array.
sizeOfArray = split(PAGENUMBERS, eachpage, ",")
# if more than 1 page number
if (sizeOfArray > 1){
# for each page number, compare it to previous number + 1
p = 0 # flag indicates assignment to pagesAll
# for loop starts at 2
for (j = 2; j-1 <= sizeOfArray; ++j) {
# start by saving first page in sequence (firstpage)
# and loop until we find last page (lastpage)
firstpage = eachpage[j-1]
d = 0 # flag indicates consecutive numbers found
# loop while page numbers are consecutive
while ((eachpage[j-1]+1) == eachpage[j] ||
eachpage[j] ~ /^-/) {
# remove "-" from troff-generated range
if (eachpage[j] ~ /^-/) {
sub(/^-/, "", eachpage[j])
}
lastpage = eachpage[j]
# increment counters
++d
++j
} # end of while loop
# use values of firstpage and lastpage to make range.
if (d >= 1) {
# there is a range
pages = firstpage "-" lastpage
}
else # no range; only read firstpage
pages = firstpage
# assign range to pagesAll
if (p == 0) {
pagesAll = pages
p = 1
}
else {
pagesAll = pagesAll "," pages
}
}# end of for loop
# assign pagesAll to listOfPages
listOfPages = pagesAll
} # end of sizeOfArray > 1
else # only one page
listOfPages = PAGENUMBERS
# add space following comma
gsub(/,/, ", ", listOfPages)
# return changed list of page numbers
return listOfPages
} # End of rangeOfPages function
This script consists of minimal BEGIN and
END procedures. The main routine does the work of
comparing primary and secondary keys. The first part of this routine
assigns the fields to variables. The second field contains the
secondary and tertiary keys and we use
split() to separate them. Then we test
that there is a tertiary key and set the flag
isTertiary to either 1 or 0.
The next part of the main procedure contains the conditional
expressions that look for identical keys. As we said in our
discussion of the pseudocode for this part of the program, entries
with wholly identical keys have already been removed by the
pagenums.idx.
The conditionals in this procedure determine what keys to output based
on whether or not each is unique. If the primary key is unique, it is
output, along with the rest of the entry. If the primary key matches
the previous key, we compare secondary keys. If the secondary key is
unique, then it is output, along with the rest of the entry. If the
primary key matches the previous primary key, and the secondary key
matches the previous secondary key, then the tertiary key must be
unique. Then we only output the tertiary key, leaving the primary and
secondary keys blank.
The different forms are shown below:
primary
primary:secondary
:secondary
:secondary:tertiary
::tertiary
primary:secondary:tertiary
The main procedure is followed by two additional routines. The first
of them is executed only when NF equals one. It
deals with the first of the forms on the list above. That is, there
is no page number so we must output a newline to finish the entry.
The second procedure deals with all entries that have page numbers.
This is the procedure where we call a function to take apart the list
of page numbers and look for consecutive pages. It calls the
numrange() function, whose main purpose is
to deal with a multivolume index where a list of page numbers might
look like:
I^35,55; II^200
This function calls split() using a
semicolon delimiter to separate each volume. Then we call
split() using a "^" delimiter to detach the
volume number from the list of page numbers. Once we have the list of
pages, we call a second function
rangeOfPages() to look for consecutive
numbers. On a single-book index, such as the sample shown in this
chapter, the numrange() function really
does nothing but call rangeOfPages(). We
discussed the meat of the rangeOfPages()
function earlier. The eachpage array is created
and a while loop is used to go through the array
comparing an element to the one previous. This function returns the
list of pages.
Sample output from this program follows:
Xlib:6
:repainting canvas:88
Xlib.h header file:89, 294
Xv_Font type:310
XView:18
:about:3, 7, 10
:as object-oriented system:17
:compiling programs:41
:concept of windows differs from X:25
:data types; table of:20
:example of programming interface:44
:frames and subframes:26
:generic functions:21
:Generic Object:18, 24
:libraries:42
:notification:10, 35
:objects:23-24;
:: table of:20;
:: list of:43
:packages:18, 43
:programmer's model:17-23
:programming interface:41
:programs; initialization:45
:reserved names:43
:reserved prefixes:43
:structure of applications:41
:subwindows:28
:types:43
:window objects:25
In particular, notice the entry for "objects" under "XView." This is
an example of a secondary key with multiple tertiary keys. It is also
an example of an entry with a consecutive page range.