home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 29.5 Adding Words to ispell's Dictionary Chapter 29
Spell Checking, Word Counting, and Textual Analysis
Next: 29.7 Count How Many Times Each Word Is Used
 

29.6 Counting Lines, Words, and Characters: wc

The wc (word count) command counts the number of lines, words, and characters in the files you specify. (Like most UNIX utilities (1.30 ) , wc reads from its standard input if you don't specify a filename.) For example, the file letter has 120 lines, 734 words, and 4297 characters:

% wc letter


     120     734    4297 letter

You can restrict what is counted by specifying the options -l (count lines only), -w (count words only), and -c (count characters only). For example, you can count the number of lines in a file:

% wc -l letter


     120 letter

or you can count the number of files in a directory:

% cd man_pages


% ls | wc -w


     233

The first example uses a file as input; the second example pipes the output of an ls command to the input of wc . (Be aware that the -a option (16.11 ) makes ls list dot files. If your ls command is aliased (10.2 ) to include -a or other options that add words to the normal output - such as the line total nnn from ls -l -then you may not get the results you want.)

The fact that you can pipe the output of a command through wc lets you use wc to perform addition and subtraction. For example, I once wrote a shell script that involved, among other things, splitting files into several pieces, and I needed the script to keep track of how many files were created. (The script ran csplit (35.10 ) on each file, producing an arbitrary number of new files named file.00 , file.01 , file.02 , etc.) Here's the code I used to solve this problem:

`...`
 


expr
 
before=`ls $file* | wc -l`              # count the file
   split the file by running it through csplit

after=`ls $file* | wc -l`               # count file plus new splits
num_files=`expr $after - $before`       # evaluate the difference

As another trick, the following command will tell you how many more words are in new.file than in old.file :

% expr `wc -w < new.file`    -    `wc -w < old.file`

[The C and Korn shells have built-in arithmetic commands and don't really need expr -but expr works in all shells. -JP  ]

Notice that you should have wc read the input files by using a < character. If instead you say:

% expr `wc -w new.file` - `wc -w old.file`

the filenames will show up in the expressions and produce a syntax error. [1]

[1] You could also type cat new.file | wc -w , but this involves two commands, so it's less efficient (13.2 ) .

count.it
Taking this concept further, here's a simple shell script to calculate the differences in word count between two files:







echo
 

count_1=`wc -w < $1`   # number of words in file 1
count_2=`wc -w < $2`   # number of words in file 2

diff_12=`expr $count_1 - $count_2`   # difference in word count

# if $diff_12 is negative, reverse order and don't show the minus sign:
case "$diff_12" in
-*) echo "$2 has `expr $diff_12 : '-\(.*\)'` more words than $1" ;;
*)  echo "$1 has $diff_12 more words than $2" ;;
esac

If this script were called count.it , then you could invoke it like this:

% count.it draft.2 draft.1


draft.1 has 23 more words than draft.2

You could modify this script to count lines or characters.

NOTE: Unless the counts are very large, the output of wc will have leading spaces. This can cause trouble in scripts if you aren't careful. For instance, in the script above, the command:

echo "$1 has $count_1 words"

might print:

draft.2 has       79 words

See the extra spaces? Understanding how the shell handles quoting (8.14 ) will help here. If you can, let the shell read the wc output and remove extra spaces. For example, without quotes, the shell passes four separate words to echo -and echo adds a single space between each word:

echo $1 has $count_1 words

that might print:

draft.2 has 79 words

That's especially important to understand when you use wc with commands like test or expr which don't expect spaces in their arguments. If you can't use the shell to strip out the spaces, delete them by piping the wc output through tr -d ' ' (35.11 ) .

Finally, two notes about file size:

  • wc -c isn't an efficient way to count the characters in large numbers of files. wc opens and reads each file, which takes time. The fourth or fifth column of output from ls -l (depending on your version) gives the character count without opening the file. You can sum ls -l counts for multiple files with the addup (49.7 ) command. For example:

    % ls -l 
    
    files
    
     | addup 4
    
    
    670518

  • Using character counts (as in the item above) doesn't give you the total disk space used by files. That's because, in general, each file takes at least one disk block to store. The du (24.9 ) command gives accurate disk usage.

- DG , JP


Previous: 29.5 Adding Words to ispell's Dictionary UNIX Power Tools Next: 29.7 Count How Many Times Each Word Is Used
29.5 Adding Words to ispell's Dictionary Book Index 29.7 Count How Many Times Each Word Is Used

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System