16.4. Inside spell
[If
you have ispell (Section 16.2), there's not a whole lot of
reason for using spell any more. Not only is
ispell more powerful, it's a
heck of a lot easier to update its spelling dictionaries.
Nonetheless, we decided to include this article, because it clarifies
the kinds of rules that spellcheckers go through to expand on the
words in their dictionaries. -- TOR]
On many Unix systems, the directory
/usr/lib/spell contains the main program invoked
by the spell command along with auxiliary programs
and data files.
On some systems, the spell command is a shell
script that pipes its input through deroff -w
and sort -u ( Section 22.6) to remove
formatting codes and prepare a sorted word list, one word per line.
On other systems, it is a standalone program that does these steps
internally. Two separate spelling lists are maintained, one for
American usage and one for British usage (invoked with the
-b option to spell).
These lists,
hlista and hlistb, cannot
be read or updated directly. They are compressed files, compiled from
a list of words represented as nine-digit hash codes. (Hash coding is
a special technique used to search for information quickly.)
The main program invoked by
spell is spellprog. It loads
the list of hash codes from either hlista or
hlistb into a table, and it looks for the hash
code corresponding to each word on the sorted word list. This
eliminates all words (or hash codes) actually found in the spelling
list. For the remaining words, spellprog tries to
derive a recognizable word by performing various operations on the
word stem based on suffix and prefix rules. A few of these
manipulations follow:
-y+iness +ness -y+i+less +less -y+ies -t+ce -t+cy
The new words created as a result of these manipulations will be
checked once more against the spell table. However, before the
stem-derivative rules are applied, the
remaining words are checked against a table of hash codes built from
the file hstop. The stop list
contains typical misspellings that stem-derivative operations might
allow to pass. For instance, the misspelled word
thier would be converted into
thy using the suffix rule -y+ier. The
hstop file accounts for as many cases of this
type of error as possible.
The final output consists of words not found in the spell
list -- even after the program tried to search for their
stems -- and words that were found in the stop list.
You can
get a better sense of these rules in action by using the
-v or -x option. The
-v option eliminates the last look-up in the table
and produces a list of words that are not actually in the spelling
list, along with possible derivatives. It allows you to see which
words were found as a result of stem-derivative operations and prints
the rule used. (Refer to the sample file in
Section 16.1.)
% spell -v sample
Alcuin
ditroff
LaserWriter
PostScript
printerr
TranScript
+out output
+s uses
The
-x option makes spell begin at
the stem-derivative stage and prints the various attempts it makes to
find the stem of each word.
% spell -x sample
...
=into
=LaserWriter
=LaserWrite
=LaserWrit
=laserWriter
=laserWrite
=laserWrit
=output
=put
...
LaserWriter
...
The stem is preceded by an equals sign (=). At the
end of the output are the words whose stem does not appear in the
spell list.
One
other file you should know about is
spellhist. On some systems, each time you run
spell, the output is appended through tee (Section 43.8) into
spellhist, in effect creating a list of all the
misspelled or unrecognized words for your site. The
spellhist file is something of a
"garbage" file that keeps on
growing: you will want to reduce it or remove it periodically. To
extract useful information from this spellhist,
you might use the sort and uniq -c (Section 21.20)
commands to compile a list of misspelled words or special terms that
occur most frequently. It is possible to add these words back into
the basic spelling dictionary, but this is too complex a process to
describe here. It's probably easier just to use a
local spelling dictionary (Section 16.1). Even better, use ispell;
not only is it a more powerful spelling program, it is much easier to
update the word lists it uses (Section 16.5).
-- DD
 |  |  | 16.3. How Do I Spell That Word? |  | 16.5. Adding Words to ispell's Dictionary |
Copyright © 2003 O'Reilly & Associates. All rights reserved.
|