Inverted Index Search (CGI Programming with Perl)

#!/usr/bin/perl -wT # This is not a CGI, so taint mode not required use strict; use File::Find; use DB_File; use Getopt::Long; require "stem.pl"; use constant DB_CACHE => 0; use constant DEFAULT_INDEX => "/usr/local/apache/data/index.db"; my( %opts, %index, @files, $stop_words ); GetOptions( \%opts, "dir=s", "cache=s", "index=s", "ignore", "stop=s", "numbers", "stem" ); die usage( ) unless $opts{dir} && -d $opts{dir}; $opts{'index'} ||= DEFAULT_INDEX; $DB_BTREE->{cachesize} = $cache || DB_CACHE; $index{"!OPTION:stem"} = 1 if $opts{'stem'}; $index{"!OPTION:ignore"} = 1 if $opts{'ignore'}; tie %index, "DB_File", $opts{'index'}, O_RDWR|O_CREAT, 0640, $DB_TREE or die "Cannot tie database: $!\n"; find( sub { push @files, $File::Find::name }, $opts{dir} ); $stop_words = load_stopwords( $opts{stop} ) if $opts{stop}; process_files( \%index, \@files, \%opts, $stop_words ); untie %index; sub load_stopwords { my $file = shift; my $words = {}; local *INFO, $_; die "Cannot file stop file: $file\n" unless -e $file; open INFO, $file or die "$!\n"; while ( <INFO> ) { next if /^#/; $words->{lc $1} = 1 if /(\S+)/; } close INFO; return $words; } sub process_files { my( $index, $files, $opts, $stop_words ) = @_; local *FILE, $_; local $/ = "\n\n"; for ( my $file_id = 0; $file_id < @$files; $file_id++ ) { my $file = $files[$file_id]; my %seen_in_file; next unless -T $file; print STDERR "Indexing $file\n"; $index->{"!FILE_NAME:$file_id"} = $file; open FILE, $file or die "Cannot open file: $file!\n"; while ( <FILE> ) { tr/A-Z/a-z/ if $opts{ignore}; s/<.+?>//gs; # Woa! what about < or > in comments or js?? while ( /([a-z\d]{2,})\b/gi ) { my $word = $1; next if $stop_words->{lc $word}; next if $word =~ /^\d+$/ && not $opts{number}; ( $word ) = stem( $word ) if $opts{stem}; $index->{$word} = ( exists $index->{$word} ? "$index->{$word}:" : "" ) . "$file_id" unless $seen_in_file{$word}++; } } } } sub usage { my $usage = <<End_of_Usage; Usage: $0 -dir directory [options] The options are: -cache DB_File cache size (in bytes) -index Path to index, default:/usr/local/apache/data/index.db -ignore Case-insensitive index -stop Path to stopwords file -numbers Include numbers in index -stem Stem words End_of_Usage return $usage; }

$index = { "!FILE_NAME:1" => "/usr/local/apache/htdocs/sports/sprint.html", "!FILE_NAME:2" => "/usr/local/apache/htdocs/sports/olympics.html", "!FILE_NAME:3" => "/usr/local/apache/htdocs/sports/celtics.html", browser => "1:2", code => "3", color => "2:3", comment => "2", content => "1", cool => "2:3", copyright => "1:2:3" };

12.3.1. Search Application

The indexer application makes our life easier when it comes time to write the CGI application to perform the actual search. The CGI application should parse the form input, open the DBM file created by the indexer, search for possible matches and then return HTML output.

Example 12-4 contains the program.

Example 12-4. indexed_search.cgi

#!/usr/bin/perl -wT

use DB_File;
use CGI;
use CGIBook::Error;
use File::Basename;
require stem.pl;

use strict;

use constant INDEX_DB => "/usr/local/apache/data/index.db";

my( %index, $paths, $path );

my $q     = new CGI;
my $query = $q->param("query");
my @words = split /\s*(,|\s+)/, $query;

tie %index, "DB_File", INDEX_DB, O_RDONLY, 0640
    or error( $q, "Cannot open database" );

$paths = search( \%index, \@words );

print $q->header,
      $q->start_html( "Inverted Index Search" ),
      $q->h1( "Search for: $query" );

unless ( @$paths ) {
    print $q->h2( $q->font( { -color => "#FF000" }, 
                            "No Matches Found" ) );
}

foreach $path ( @$paths ) {
    my $file = basename( $path );
    next unless $path =~ s/^\Q$ENV{DOCUMENT_ROOT}\E//o;
    $path = to_uri_path( $path );
    print $q->a( { -href => "$path" }, "$path" ), $q->br;
} 

print $q->end_html;
untie %index;



sub search {
    my( $index, $words ) = @_;
    my $do_stemming = exists $index->{"!OPTION:stem"} ? 1 : 0;
    my $ignore_case = exists $index->{"!OPTION:ignore"} ? 1 : 0;
    my( %matches, $word, $file_index );
    
    foreach $word ( @$words ) {
        my $match;
        
        if ( $do_stemming ) {
            my( $stem )  = stem( $word );
            $match = $index->{$stem};
        }
        elsif ( $ignore_case ) {
            $match = $index->{lc $word};
        }
        else {
            $match = $index->{$word};
        }
        
        next unless $match;
        
        foreach $file_index ( split /:/, $match ) {
            my $filename = $index->{"!FILE_NAME:$file_index"};
            $matches{$filename}++;
        }
    }
    my @files = map  { $_->[0] }
                sort { $matches{$a->[0]} <=> $matches{$b->[0]} || 
                       $a->[1] <=> $b->[1] }
                map  { [ $_, -M $_ ] }
                keys %matches;
    
    return \@files;
}

sub to_uri_path {
    my $path = shift;
    my( $name, @elements );
    
    do {
        ( $name, $path ) = fileparse( $path );
        unshift @elements, $name;
        chop $path;
    } while $path;
    
    return join '/', @elements;
}

The modules should be familiar to you by now. The INDEX_DB constant contains the path of the index created by the indexer application.

Since a query can include multiple words, we split it on any whitespace or a comma and store the resulting words in the @words array. We use tie to open the index DBM file in read-only mode. In other words, we bind the index file with the %index hash. If we cannot open the file, we call our error function to return an error to the browser.

The real searching is done appropriately enough in the search function, which takes a reference to the index hash and a reference to the list of words we are searching for. The first thing we do is to peek into the index and see if the stem option was set when the index was built. We then proceed to iterate through the @$words array, searching for possible matches. If stemming was enabled, we stem the word and compare that. Otherwise, we check to see whether the particular word exists in the index as-is, or as a lowercase word if the index is not case-sensitive. If any of these comparisons succeeds, we have got a match. Otherwise, we ignore the word and continue.

If there is a match, we split the colon separated list of file id's where that particular word is found. Since we don't want duplicate entries in our final list, we store the full path of the matching files in the %matches hash.

After the loop has finished executing, we are left with the matching files in %matches. We would like to add some order to our results and display them according to the number of words matching and then by the file's modification time. So, we sort the keys according to the number of matches and then by the data returned by the -M operator, and store the recently modified files in the @files array.

We could calculate the modification time of the files during each comparison like this:

my @files = sort { $matches{$_} <=> $matches{$_} ||
                   -M $_ <=> -M $_ }
            keys %matches;

However, this is inefficient because we might calculate the modification time for each file multiple times. A more efficient algorithm involves precalculating the modification times as we have done in the program.

This strategy has become known as the Schwartzian Transform, made famous by Randal Schwartz. It's beyond the scope of this book to explain this, but if you're interested, see Joseph Hall's explanation of the Transform, located at: http://www.5sigma.com/perl/schwtr.html. Ours is a slight variation because we perform a two-part sort.

We output the HTTP and HTML document headers, and proceed to check to see if we have any matches. If not, we return a simple message. Otherwise, we iterate through the @files array, setting $path to the current element each time through the loop. We strip off the part of the path that matches the server's root directory. That should give us the path that corresponds to a URL. However, on non-Unix filesystems, we won't have forward slashes ("/") separating directories. So we call the to_uri_path function, which uses the File::Basename module to strip off successive elements of the path and then rebuild it with forward slashes. Note that this will work on many operating systems like Win32 and MacOS, but it will not work on systems that do not use a single character to delimit parts of the path (like VMS; although, the chances that you're actually doing CGI development on a VMS machine are pretty slim).

We build proper links with this newly formatted path, print the remainder of our results, close the binding between the database and the hash, and exit.

Example 12-4. indexed_search.cgi

12.3. Inverted Index Search

Example 12-3. indexer.pl

12.3.1. Search Application