[Chapter 12] 12.3 Spare Details of the masterindex Program

12.3 Spare Details of the masterindex Program

This section presents a few interesting details of the masterindex program that might otherwise escape attention. The purpose of this section is to extract some interesting program fragments and show how they solve a particular problem.

12.3.1 How to Hide a Special Character

Our first fragment is from the input.idx script, whose job it is to standardize the index entries before they are sorted. This program takes as its input a record consisting of two tab-separated fields: the index entry and its page number. A colon is used as part of the syntax for indicating the parts of an index entry.

Because the program uses a colon as a special character, we must provide a way to pass a literal colon through the program. To do this, we allow the indexer to specify two consecutive colons in the input. However, we can't simply convert the sequence to a literal colon because the rest of the program modules called by masterindex read three colon-separated fields. The solution is to convert the colon to its octal value using the gsub() function.

#< from input.idx
# convert literal colon to octal value
$1 ~ /::/ {
        # substitute octal value for "::"
        gsub(/::/, "\\72", $1)

"\\72" represents the octal value of a colon. (You can find this value by scanning a table of hexadecimal and octal equivalents in the file /usr/pub/ascii .) In the last program module, we use gsub() to convert the octal value back to a colon. Here's the code from format.idx .

#< from format.idx
# convert octal colon to "literal" colon
# make sub for each field, not $0, so that fields are not parsed
        gsub(/\\72/, ":", $1)
        gsub(/\\72/, ":", $2)
        gsub(/\\72/, ":", $3)

The first thing you notice is that we make this substitution for each of the three fields separately, instead of having one substitution command that operates on $0. The reason for this is that the input fields are colon-separated. When awk scans an input line, it breaks the line into fields. If you change the contents of $0 at any point in the script, awk will reevaluate the value of $0 and parse the line into fields again. Thus, if you have three fields prior to making the substitution, and the substitution makes one change, adding a colon to $0, then awk will recognize four fields. By doing the substitution for each field, we avoid having the line parsed again into fields.

12.3.2 Rotating Two Parts

Above we talked about the colon syntax for separating the primary and secondary keys. With some kinds of entries, it makes sense to classify the item under its secondary key as well. For instance, we might have a group of program statements or user commands, such as "sed command." The indexer might create two entries: one for "sed command" and one for "command: sed." To make coding this kind of entry easier, we implemented a coding convention that uses a tilde (~) character to mark the two parts of this entry so that the first and second part can be swapped to create the second entry automatically.[5] Thus, coding the following index entry

[5] The idea of rotating index entries was derived from The AWK Programming Language . There, however, an entry is automatically rotated where a blank is found; the tilde is used to prevent a rotation by "filling in" the space. Rather than have rotation be the default action, we use a different coding convention, where the tilde indicates where the rotation should occur.

.XX "sed~command"

produces two entries:

sed command	 43
command: sed	 43

Here's the code that rotates entries.

#< from input.idx 
# Match entries that need rotating that contain a single tilde
$1 ~ /~/ && $1 !~ /~~/ { 
	# split first field into array named subfield 
	n = split($1, subfield, "~")
	if (n == 2) {
	# print entry without "~" and then rotated
		printf("%s %s::%s\n", subfield[1], subfield[2], $2)
		printf("%s:%s:%s\n", subfield[2], subfield[1], $2)
	}
        next
}

The pattern-matching rule matches any entry containing a tilde but not two consecutive tildes, which indicate a literal tilde. The procedure uses the split() function to break the first field into two "subfields." This gives us two substrings, one before and one after the tilde. The original entry is output and then the rotated entry is output, both using the printf statement.

Because the tilde is used as a special character, we use two consecutive tildes to represent a literal tilde in the input. The following code occurs in the program after the code that swaps the two parts of an entry.

#< from input.idx 
# Match entries that contain two tildes 
$1 ~ /~~/ { 
	# replace ~~ with ~	
	gsub(/~~/, "~", $1)
}

Unlike the colon, which retains a special meaning throughout the masterindex program, the tilde has no significance after this module so we can simply output a literal tilde.

12.3.3 Finding a Replacement

The next fragment also comes from input.idx . The problem was to look for two colons separated by text and change the second colon to a semicolon. If the input line contains

class: class initialize: (see also methods)

then the result is:

class: class initialize; (see also methods)

The problem is fairly simple to formulate - we want to change the second colon, not the first one. It is pretty easy to solve in sed because of the ability to select and recall a portion of what is matched in the replacement section (using $...$ to surround the portion to match and \1 to recall the first portion). Lacking the same ability in awk, you have to be more clever. Here's one possible solution:

#< from input.idx
#  replace 2nd colon with semicolon
if (sub(/:.*:/, "&;", $1)) 
         sub(/:;/, ";", $1)

The first substitution matches the entire span between two colons. It makes a replacement with what is matched (&) followed by a semicolon. This substitution occurs within a conditional expression that evaluates the return value of the sub() function. Remember, this function returns 1 if a substitution is made - it does not return the resulting string. In other words, if we make the first substitution, then we make the second one. The second substitution replaces ":;" with ";". Because we can't make the replacement directly, we do it indirectly by making the context in which the second colon appears distinct.

12.3.4 A Function for Reporting Errors

The purpose of the input.idx program is to allow variations (or less kindly, inconsistencies) in the coding of index entries. By reducing these variations to one basic form, the other programs are made easier to write.

The other side is that if the input.idx program cannot accept an entry, it must report it to the user and drop the entry so that it does not affect the other programs. The input.idx program has a function used for error reporting called printerr() , as shown below:

function printerr (message) {
	# print message, record number and record
	printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/dev/tty"
}

This function makes it easier to report errors in a standard manner. It takes as an argument a message , which is usually a string that describes the error. It outputs this message along with the record number and the record itself. The output is directed to the user's terminal "/dev/tty." This is a good practice since the standard output of the program might be, as it is in this case, directed to a pipe or to a file. We could also send the error message to standard error, like so:

print "ERROR:" message " (" NR ") "  $0 | "cat 1>&2"

This opens a pipe to cat , with cat 's standard output redirected to the standard error. If you are using gawk, mawk, or the Bell Labs awk, you could instead say:

printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/dev/stderr"

In the program, the printerr() function is called as follows:

printerr("No page number")

When this error occurs, the user sees the following error message:

ERROR:No page number (612) geometry management:set_values_almost

12.3.5 Handling See Also Entries

One type of index entry is a "see also." Like a "see" reference, it refers the reader to another entry. However, a "see also" entry may have a page number as well. In other words, this entry contains information of its own but refers the reader elsewhere for additional information. Here are a few sample entries.

error procedure	34
error procedure (see also XtAppSetErrorMsgHandler)	35
error procedure (see also XtAppErrorMsg)

The first entry in this sample has a page number while the last one does not. When the input.idx program finds a "see also" entry, it checks to see if a page number ($2) is supplied. If there is one, it outputs two records, the first of which is the entry without the page number and the second of which is an entry and page number without the "see also" reference.

#< input.idx
# if no page number
        if ($2 == "") {
                print $0 ":"
                next
        }
        else {
        # output two entries:
        # print See Also entry w/out page number
                print $1 ":"
        # remove See Also
                sub(/ *~zz\(see also.*$/, "", $1)
                sub(/;/, "", $1)
        # print as normal entry
                if ( $1 ~ /:/ )
                        print $1 ":" $2
                else
                        print $1 "::" $2
                next
        }

The next problem to be solved was how to get the entries sorted in the proper order. The sort program, using the options we gave it, sorted the secondary keys for "see also" entries together under "s." (The -d option causes the parenthesis to be ignored.) To change the order of the sort, we alter the sort key by adding the sequence "~zz" to the front of it.

#< input.idx
# add "~zz" for sort at end
        sub(/\([Ss]ee [Aa]lso/, "~zz(see also", $1)

The tilde is not interpreted by the sort but it helps us identify the string later when we remove it. Adding "~zz" assures us of sorting to the end of the list of secondary or tertiary keys.

The pagenums.idx script removes the sort string from "see also" entries. However, as we described earlier, we look for a series of "see also" entries for the same key and create a list. Therefore, we also remove that which is the same for all entries, and put the reference itself in an array:

#< pagenums.idx
# remove secondary key along with "~zz"
      sub(/^.*~zz\([Ss]ee +[Aa]lso */, "", SECONDARY)
      sub(/\) */, "", SECONDARY)
# assign to next element of seeAlsoList
      seeAlsoList[++eachSeeAlso] = SECONDARY "; "

There is a function that outputs the list of "see also" entries, separating each of them by a semicolon. Thus, the output of the "see also" entry by pagenums.idx looks like:

error procedure:(see also XtAppErrorMsg; XtAppSetErrorHandler.)

12.3.6 Alternative Ways to Sort

In this program, we chose not to support troff font and point size requests in index entries. If you'd like to support special escape sequences, one way to do so is shown in The AWK Programming Language . For each record, take the first field and prepend it to the record as the sort key. Now that there is a duplicate of the first field, remove the escape sequences from the sort key. Once the entries are sorted, you can remove the sort key. This process prevents the escape sequences from disturbing the sort.

Yet another way is to do something similar to what we did for "see also" entries. Because special characters are ignored in the sort, we could use the input.idx program to convert a troff font change sequence such as "\fB" to "~~~" and "\fI" to "~~~~," or any convenient escape sequence. This would get the sequence through the sort program without disturbing the sort. (This technique was used by Steve Talbott in his original indexing script.)

The only additional problem that needs to be recognized in both cases is that two entries for the same term, one with font information and one without, will be treated as different entries when one is compared to the other.