home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 29.9 Looking for Closure Chapter 29
Spell Checking, Word Counting, and Textual Analysis
Next: V. Text Editing
 

29.10 Just the Words, Please

In various kinds of textual analysis scripts, you sometimes need just the words ( 29.8 ) .

I know two ways to do this. The deroff command was designed to strip out troff ( 43.13 ) constructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to sort -u ( 36.6 ) if you want only one of each.

deroff has one major failing, though. It only considers a word to be a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A."

A substitute is tr ( 35.11 ) , which can perform various kinds of character-by-character conversions.

To produce a list of all the individual words in a file, type:



<
 

% 

tr -cs A-Za-z '\012' <



 file

The -c option "complements" the first string passed to tr ; -s squeezes out repeated characters. This has the effect of saying: "Take any non-alphabetic characters you find (one or more) and convert them to newlines (\012)."

(Wouldn't it be nice if tr just recognized standard UNIX regular expression syntax ( 26.4 ) ? Then, instead of -c A-Za-z , you'd say '[^A-Za-z]' . It's not any less obscure, but at least it's used by other programs, so there's one less thing to learn.)

The System V version of tr ( 35.11 ) has slightly different syntax. You'd get the same effect with:

% 

tr -cs '[A-Z][a-z]' '[\012*]' < 



file

- TOR


Previous: 29.9 Looking for Closure UNIX Power Tools Next: V. Text Editing
29.9 Looking for Closure Book Index V. Text Editing

The UNIX CD Bookshelf Navigation The UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System