[Chapter 29] 29.4 Inside spell

29.4 Inside spell

[If you have ispell (29.2 ) , there's not a whole lot of reason for using spell any more. Not only is ispell more powerful, it's a heck of a lot easier to update its spelling dictionaries. Nonetheless, we decided to include this article, because it makes clear the kinds of rules that spelling checkers go through to expand on the words in their dictionaries. -TOR\]

On many UNIX systems, the directory /usr/lib/spell contains the main program invoked by the spell command along with auxiliary programs and data files.

% ls -l /usr/lib/spell


total 888
-rwxr-xr-x   1 bin          545 Dec  9  1988 compress
-rwxr-xr-x   1 bin        16324 Dec  9  1988 hashcheck
-rwxr-xr-x   1 bin        14828 Dec  9  1988 hashmake
-rw-r--r--   1 bin        53872 Dec  9  1988 hlista
-rw-r--r--   1 bin        53840 Dec  9  1988 hlistb
-rw-r--r--   1 bin         6336 Dec  9  1988 hstop
-rw-rw-rw-   1 root      252312 Nov 27 16:24 spellhist
-rwxr-xr-x   1 bin        21634 Dec  9  1988 spellin
-rwxr-xr-x   1 bin        23428 Dec  9  1988 spellprog

On some systems, the spell command is a shell script that pipes its input through deroff -w (29.10 ) and sort -u (36.6 ) to remove formatting codes and prepare a sorted word list, one word per line. On other systems, it is a stand-alone program that does these steps internally. Two separate spelling lists are maintained, one for American usage and one for British usage (invoked with the -b option to spell ). These lists, hlista and hlistb , cannot be read or updated directly. They are compressed files, compiled from a list of words represented as nine-digit hash codes. (Hash coding is a special technique used to quickly search for information.)

The main program invoked by spell is spellprog . It loads the list of hash codes from either hlista or hlistb into a table, and looks for the hash code corresponding to each word on the sorted word list. This eliminates all words (or hash codes) actually found in the spelling list. For the remaining words, spellprog tries to see if it can derive a recognizable word by performing various operations on the word stem, based on suffix and prefix rules. A few of these manipulations follow:

-y+iness +ness -y+i+less +less -y+ies -t+ce -t+cy

The new words created as a result of these manipulations will be checked once more against the spell table. However, before the stem-derivative rules are applied, the remaining words are checked against a table of hash codes built from the file hstop . The stop list contains typical misspellings that stem-derivative operations might allow to pass. For instance, the misspelled word thier would be converted into thy using the suffix rule -y+ier. The hstop file accounts for as many cases of this type of error as possible.

The final output consists of words not found in the spell list, even after the program tried to search for their stems, and words that were found in the stop list.

You can get a better sense of these rules in action by using the -v or -x option. The -v option eliminates the last lookup in the table, and produces a list of words that are not actually in the spelling list along with possible derivatives. It allows you to see which words were found as a result of stem-derivative operations, and prints the rule used. (Refer to the sample file in article 29.1 .)

% spell -v sample


Alcuin
ditroff
LaserWriter
PostScript
printerr
TranScript
+out  output
+s    uses

The -x option makes spell begin at the stem-derivative stage, and prints the various attempts it makes to find the word stem of each word.

% spell -x sample


...
=into
=LaserWriter
=LaserWrite
=LaserWrit
=laserWriter
=laserWrite
=laserWrit
=output
=put
...
LaserWriter
...

The stem is preceded by an equal sign (= ). At the end of the output are the words whose stem does not appear in the spell list.

One other file you should know about is spellhist . On some systems, each time you run spell , the output is appended through tee (13.9 ) into spellhist , in effect creating a list of all the misspelled or unrecognized words for your site. The spellhist file is something of a "garbage" file that keeps on growing. You will want to reduce it or remove it periodically. To extract useful information from this spellhist , you might use the sort and uniq -c (35.20 ) commands to compile a list of misspelled words or special terms that occur most frequently (see article 29.7 for a similar example). It is possible to add these words back into the basic spelling dictionary, but this is too complex a process to describe here. It's probably easier just to use a local spelling dictionary (29.1 ) . Even better, use ispell ; not only is it a more powerful spelling program, it is much easier to update the word lists it uses (29.5 ) .

- DD from UNIX Text Processing , Hayden Books, 1987