16.9. Just the Words, Please
In various textual-analysis scripts, you sometimes need just the words (Section 16.7).
I know two ways to do this. The deroff command was designed to strip out troff (Section 45.11) constructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to sort -u (Section 22.6) if you want only one of each.
deroff has one major failing, though. It considers a word as just a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A."
A substitute is tr (Section 21.11), which can perform various kinds of character-by-character conversions.
To produce a list of all the individual words in a file, type the following:
% tr -cs A-Za-z '\012' < file
The -c option "complements" the first string passed to tr; -s squeezes out repeated characters. This has the effect of saying: "Take any nonalphabetic characters you find (one or more) and convert them to newlines (\012)."
(Wouldn't it be nice if tr just recognized standard Unix regular expression syntax (Section 32.4)? Then, instead of -c A-Za-z, you'd say '[^A-Za-z]'. It's no less obscure, but at least it's used by other programs, so there's one less thing to learn.)
The System V version of tr (Section 21.11) has slightly different syntax. You'd get the same effect with this:
% tr -cs '[A-Z][a-z]' '[\012*]' < file
Copyright © 2003 O'Reilly & Associates. All rights reserved.