29.10 Just the Words, PleaseIn various kinds of textual analysis scripts, you sometimes need just the words (29.8 ) . I know two ways to do this. The deroff command was designed to strip out troff (43.13 ) constructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to sort -u (36.6 ) if you want only one of each. deroff has one major failing, though. It only considers a word to be a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A." A substitute is tr (35.11 ) , which can perform various kinds of character-by-character conversions. To produce a list of all the individual words in a file, type:
The -c option "complements" the first string passed to tr ; -s squeezes out repeated characters. This has the effect of saying: "Take any non-alphabetic characters you find (one or more) and convert them to newlines (\012)." (Wouldn't it be nice if tr
just recognized standard UNIX
regular expression syntax (26.4
)
?
Then, instead of The System V version of tr (35.11 ) has slightly different syntax. You'd get the same effect with: % - |
|