16.26 Finding Text Files with findtext
Some of my directories - my bin ( 4.2 ) , for instance - have some text files (like shell scripts and documentation) as well as non-text files (executable binary files, compressed files, archives, etc.). If I'm trying to find a certain file - with grep ( 27.1 ) or a pager ( 25.3 , 25.4 ) - the non-text files can print garbage on my screen. I want some way to say "only look at the files that have text in them."
The findtext shell script does that. It runs file ( 25.8 ) to guess what's in each file. It only prints filenames of text files.
So, for example, instead of typing:
Here's the script, then some explanation of how to set it up on your system:
#!/bin/sh # PIPE OUTPUT OF file THROUGH sed TO PRINT FILENAMES FROM LINES # WE LIKE. NOTE: DIFFERENT VERSIONS OF file RETURN DIFFERENT # MESSAGES. CHECK YOUR SYSTEM WITH strings /usr/bin/file OR # cat /etc/magic AND ADAPT THIS. /usr/bin/file "$@" | sed -n ' /MMDF mailbox/b print /Interleaf ASCII document/b print /PostScript document/b print /Frame Maker MIF file/b print /c program text/b print /fortran program text/b print /assembler program text/b print /shell script/b print /c-shell script/b print /shell commands/b print /c-shell commands/b print /English text/b print /ascii text/b print /\[nt\]roff, tbl, or eqn input text/b print /executable .* script/b print b :print s/: [TAB] .*//p'
The script is simple: It runs file on the command-line arguments. The output of file looks like this:
COPY2PC: directory Ex24348: empty FROM_consult.tar.Z: compressed data block compressed 16 bits GET_THIS: ascii text hmo: English text msg: English text 1991.ok: [nt]roff, tbl, or eqn input text
The output is piped to a
script that selects the lines that seem to be from text files - after the
Different versions of file produce different output. Some versions also read an /etc/magic file. To find the kinds of names your file calls text files, use commands like:
The possible file will have a list of descriptions that strings found in the file binary; some of them are for text files. If your system has an /etc/magic file, it will have lines like these:
0 long 0x1010101 MMDF mailbox 0 string <!OPS Interleaf ASCII document 0 string %! PostScript document 0 string <MIFFile Frame Maker MIF file
Save the descriptions of text-type files from the right-hand column.
Then, turn each line of your edited possible file into a sed command:
Watch for special characters in the file descriptions. I had to handle two special cases in the last two lines of the script above:
If you have perl ( 37.1 ) , you can make a simpler version of this script, since perl has a built-in test for whether or not a file is a text file. Perl picks a "text file" by checking the first block or so for strange control codes or metacharacters. If there are too many (more than 10%), it's not a text file. You can't tune the Perl script to, for example, skip a certain kind of file by type. But the Perl version is simpler! It looks like this:
alias findtext 'perl -le '\''-T && print while $_ = shift'\'' *'