[Chapter 16] 16.26 Finding Text Files with findtext

16.26 Finding Text Files with findtext

Some of my directories - my bin ( 4.2 ) , for instance - have some text files (like shell scripts and documentation) as well as non-text files (executable binary files, compressed files, archives, etc.). If I'm trying to find a certain file - with grep ( 27.1 ) or a pager ( 25.3 , 25.4 ) - the non-text files can print garbage on my screen. I want some way to say "only look at the files that have text in them."

The findtext shell script does that. It runs file ( 25.8 ) to guess what's in each file. It only prints filenames of text files.

So, for example, instead of typing:

% 

egrep something *

I type:

`...`	% egrep something `findtext *`

Here's the script, then some explanation of how to set it up on your system:

#!/bin/sh

# PIPE OUTPUT OF file THROUGH sed TO PRINT FILENAMES FROM LINES
# WE LIKE.  NOTE: DIFFERENT VERSIONS OF file RETURN DIFFERENT
# MESSAGES.  CHECK YOUR SYSTEM WITH strings /usr/bin/file OR
# cat /etc/magic AND ADAPT THIS.
/usr/bin/file "$@" |
sed -n  '
/MMDF mailbox/b print
/Interleaf ASCII document/b print
/PostScript document/b print
/Frame Maker MIF file/b print
/c program text/b print
/fortran program text/b print
/assembler program text/b print
/shell script/b print
/c-shell script/b print
/shell commands/b print
/c-shell commands/b print
/English text/b print
/ascii text/b print
/\[nt\]roff, tbl, or eqn input text/b print
/executable .* script/b print
b

:print
s/:
[TAB]
.*//p'

The script is simple: It runs file on the command-line arguments. The output of file looks like this:

COPY2PC:        directory
Ex24348:        empty
FROM_consult.tar.Z:     compressed data block compressed 16 bits

GET_THIS:       ascii text
hmo:            English text
msg:            English text
1991.ok:        [nt]roff, tbl, or eqn input text

The output is piped to a sed ( 34.24 ) script that selects the lines that seem to be from text files - after the print label, the script strips off everything after the filename (starting at the colon) and prints the filename.

Different versions of file produce different output. Some versions also read an /etc/magic file. To find the kinds of names your file calls text files, use commands like:

% 

strings /usr/bin/file > possible


% 

cat /etc/magic >> possible


% 

vi possible

The possible file will have a list of descriptions that strings found in the file binary; some of them are for text files. If your system has an /etc/magic file, it will have lines like these:

0    long         0x1010101       MMDF mailbox
0    string       <!OPS           Interleaf ASCII document
0    string       %!              PostScript document
0    string       <MIFFile        Frame Maker MIF file

Save the descriptions of text-type files from the right-hand column.

Then, turn each line of your edited possible file into a sed command:

b print	`/` `description` `/b print`

Watch for special characters in the file descriptions. I had to handle two special cases in the last two lines of the script above:

I had to change the string executable %s script from our file command to /executable .* script/b print in the sed script. That's because our file command replaces %s with a name like /bin/ksh .
Characters that sed will treat as a regular expression, such as the brackets in [nt]roff , need to be escaped with backslashes. I used \[nt\]troff in the script.

If you have perl ( 37.1 ) , you can make a simpler version of this script, since perl has a built-in test for whether or not a file is a text file. Perl picks a "text file" by checking the first block or so for strange control codes or metacharacters. If there are too many (more than 10%), it's not a text file. You can't tune the Perl script to, for example, skip a certain kind of file by type. But the Perl version is simpler! It looks like this:

% 

perl -le '-T && print while $_ = shift' *

csh_init sh_init	If you want to put that into an alias ( 10.2 ) , the C shell's quoting problems ( 47.2 , 8.15 ) make it tough to do. Thanks to makealias ( 10.8 ) , though, here's an alias that does the job:

alias findtext 'perl -le '\''-T && print while $_ = shift'\'' *'

- JP