[Chapter 9] 9.4 Search/Index Gateway

9.4 Search/Index Gateway

One of the most useful CGI applications is a web server search/index gateway. This allows a user to search all of the files on the server for particular information. Here is a very simple gateway to do just that. We rely on the UNIX command fgrep [1] to search all our files, and then filter its output to something attractive and useful. First, let's look at the form's front end:

[1] The fgrep used in the example is GNU fgrep version 2.0, which supports the -A and -B options.

<HTML>
<HEAD><TITLE>Search Gateway</TITLE></HEAD>
<BODY>
<H1>Search Gateway</H1>
<HR>
<FORM ACTION="/cgi-bin/search.pl" METHOD="POST">
What would you like to search for:
<BR>
<INPUT TYPE="text" NAME="query" SIZE=40>
<P>
<INPUT TYPE="submit" VALUE="Start Searching!">
<INPUT TYPE="reset"  VALUE="Clear your form">
</FORM>
<HR>
</BODY>
</HTML>

Nothing fancy. The form contains just one field to hold the search query. Now, here is the program:

#!/usr/local/bin/perl
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$fgrep = "/usr/local/bin/fgrep";
$document_root = $ENV{'DOCUMENT_ROOT'};

The fgrep UNIX command is used to perform the actual searching in the directory pointed to by the variable document_root. fgrep searches for fixed strings; in other words, wildcards and regular expressions are not evaluated.

&parse_form_data (*SEARCH);
$query = $SEARCH{'query'};

The form data (or one field) is decoded and stored in the SEARCH associative array.

if ($query eq "") {
    &return_error (500, "Search Error", "Please enter a search query.");
} elsif ($query !~ /^(\w+)$/) {
    &return_error (500, "Search Error", "Invalid characters in query.");
} else {

If the query entered by the user contains a non-alphanumeric character (A-Z, a-z, 0-9, _), or is empty, an error message is returned.

    print "Content-type: text/html", "\n\n";
        print "<HTML>", "\n";
    print "<HEAD><TITLE>Search Results</TITLE></HEAD>";
        print "<BODY>", "\n";
    print "<H1>Results of searching for: ", $query, "</H1>";
    print "<HR>";
    open (SEARCH, "$fgrep -A2 -B2 -i -n -s $query $document_root/* |");

The pipe is opened to the fgrep command for output. We use the following command-line options:

-A2 and -B2 display two lines before and after the match
-i indicates case insensitivity
-n displays the line numbers
-s instructs fgrep to suppress all error messages.

Here is what the output format looks like:

/abc/cde/filename.abc-57-Previous, previous line
/abc/cde/filename.abc-58-Previous line
/abc/cde/filename.abc-59:Matched line 
/abc/cde/filename.abc-60-Following line 
/abc/cde/filename.abc-61-Following, following line

As you can see, a total of five or more lines are output for each match. If the query string is found in multiple files, fgrep returns the "--" boundary string to separate the output from the different files.

    $count = 0;
    $matches = 0;
    %accessed_files = ();

Three important variables are initialized. The first one, count, is used to keep track of the number of lines returned per match. The matches variable stores the number of different files that contain the specified query. And finally, the accessed_files associative array keeps track of the filenames that contain a match.

We could have used another grep command that returned just filenames, and then our processing would be much easier. But I want to display the actual text found, so I chose more complicated output. Thus, I have to do a little fancy parsing and text substitution to change the lines of fgrep output into something that looks good on a web browser. What we want to display is:

The name of each file found, with a hypertext link so the user can go directly to a file
The text found with the search string highlighted
A summary of the files found

The following code performs these steps.

    while (<SEARCH>) {
        if ( ($file, $type, $line) = m|^(/\S+)([\-:])\d+\2(.*)| ) {

The while loop iterates through the data returned by fgrep. If a line resembles the format presented above, this block of code is executed. The regular expression is explained below.

[Graphic: Figure from the text]

             unless ($count) {
                if ( defined ($accessed_files{$file}) ) {
                    next;
                } else {
                    $accessed_files{$file} = 1;
                }
                $file =~ s/^$document_root\/(.*)/$1/;
                $matches++;
                print qq|<A HREF="/$file">$file</A><BR><BR>|;
            }

If count is equal to zero (which means we are either on line 1 or on the line right after the boundary), the associative array is checked to see if an element exists for the current filename. If it exists, there is a premature break from the conditional, and the while loop executes again. If not, the matches variable is incremented, and a hypertext anchor is linked to the relative pathname of the matched file.

Remember, if there is more than one match per file, fgrep returns the matched lines as separate entities (separated by the "--" string). Since we want only one link per filename, the associative array has to be used to "cache" the filename.

            $count++;
            $line =~ s/<(([^>]|\n)*)>/&lt;$1&gt;/g;

The count variable is incremented so that the next time through the loop, the previous block of code will not be executed, and therefore a hypertext link will not be created. Also, all HTML tags are "escaped" by the regular expression illustrated below, so that they appear as regular text when this dynamic document is displayed. If we did not escape these tags, the browser would interpret them as regular HTML statements, and display formatted output.

[Graphic: Figure from the text]

We could totally remove all tags by using:

$line =~ s/<(([^>]|\n)*)>//g;

Let's continue with the program:

            if ($line =~ /^[^A-Za-z0-9]*$/) {
                next;
            }

If a line consists of any characters besides the subset of alphanumeric characters (A-Z, a-z, 0-9), the line will not be displayed.

            if ($type eq ":") {
                $line =~ s/($query)/<B>$1<\/B>/ig;
            }
            print $line, "<BR>";

For the matched line, the query is emboldened using the <B> ... </B> HTML tags, and printed.

        } else {
            if ($count) {
                print "<HR>";
                $count = 0;
            }
        }
    }

This conditional is executed if the line contains the boundary string, in which case a horizontal rule is output and the counter is initialized.

    print "<P>", "<HR>";
    print "Total number of files containing matches: ", $matches, "<BR>";
    print "<HR>";
        print "</BODY></HTML>", "\n";
    close (SEARCH);
}
exit (0);

Finally, the total number of files that contained matches to the query are displayed, as shown in Figure 9.11.

Figure 9.11: Search results

[Graphic: Figure 9-11]

This is a very simple example of a search/index utility. It can be quite slow if you need to search hundreds (or thousands) of documents. However, there are numerous indexing engines (as well as corresponding CGI gateways) that are extremely fast and powerful. These include Swish and Glimpse. See Appendix E, information on where to retrieve those packages.