12.3. Inverted Index SearchThe applications that we've looked at so far search through each and every file in the specified directory, looking for particular words or phrases. This is not only time consuming, but will also place a great burden on the server. We clearly need a different approach to searching. A more efficient approach is to create an index (like the one you can find at the back of this and other books) containing all the words from specific documents and the name of the document in which they appear. In this section, we will discuss an application that creates an inverted index. The index is inverted in the sense that a particular word is used to find the file(s) in which it appears, rather than the other way around. In the following section, we will look at the CGI script that searches this index and presents the results in a nice format. Example 12-3 creates the indexer. Example 12-3. indexer.pl#!/usr/bin/perl -wT # This is not a CGI, so taint mode not required use strict; use File::Find; use DB_File; use Getopt::Long; require "stem.pl"; use constant DB_CACHE => 0; use constant DEFAULT_INDEX => "/usr/local/apache/data/index.db"; my( %opts, %index, @files, $stop_words ); GetOptions( \%opts, "dir=s", "cache=s", "index=s", "ignore", "stop=s", "numbers", "stem" ); die usage( ) unless $opts{dir} && -d $opts{dir}; $opts{'index'} ||= DEFAULT_INDEX; $DB_BTREE->{cachesize} = $cache || DB_CACHE; $index{"!OPTION:stem"} = 1 if $opts{'stem'}; $index{"!OPTION:ignore"} = 1 if $opts{'ignore'}; tie %index, "DB_File", $opts{'index'}, O_RDWR|O_CREAT, 0640, $DB_TREE or die "Cannot tie database: $!\n"; find( sub { push @files, $File::Find::name }, $opts{dir} ); $stop_words = load_stopwords( $opts{stop} ) if $opts{stop}; process_files( \%index, \@files, \%opts, $stop_words ); untie %index; sub load_stopwords { my $file = shift; my $words = {}; local *INFO, $_; die "Cannot file stop file: $file\n" unless -e $file; open INFO, $file or die "$!\n"; while ( <INFO> ) { next if /^#/; $words->{lc $1} = 1 if /(\S+)/; } close INFO; return $words; } sub process_files { my( $index, $files, $opts, $stop_words ) = @_; local *FILE, $_; local $/ = "\n\n"; for ( my $file_id = 0; $file_id < @$files; $file_id++ ) { my $file = $files[$file_id]; my %seen_in_file; next unless -T $file; print STDERR "Indexing $file\n"; $index->{"!FILE_NAME:$file_id"} = $file; open FILE, $file or die "Cannot open file: $file!\n"; while ( <FILE> ) { tr/A-Z/a-z/ if $opts{ignore}; s/<.+?>//gs; # Woa! what about < or > in comments or js?? while ( /([a-z\d]{2,})\b/gi ) { my $word = $1; next if $stop_words->{lc $word}; next if $word =~ /^\d+$/ && not $opts{number}; ( $word ) = stem( $word ) if $opts{stem}; $index->{$word} = ( exists $index->{$word} ? "$index->{$word}:" : "" ) . "$file_id" unless $seen_in_file{$word}++; } } } } sub usage { my $usage = <<End_of_Usage; Usage: $0 -dir directory [options] The options are: -cache DB_File cache size (in bytes) -index Path to index, default:/usr/local/apache/data/index.db -ignore Case-insensitive index -stop Path to stopwords file -numbers Include numbers in index -stem Stem words End_of_Usage return $usage; } We will use File::Find to get a list of all the files in the specified directory, as well as files in any subdirectories. The File::Basename module provides us with functions to extract the filename, given the full path. You might be wondering why we need this feature, considering the fact that we can use a simple regular expression to get at the filename. If we use a regex, we will have to determine what platform we're using the application on, and accordingly extract the filename. This module takes care of that for us. We use DB_File to create and store the index. Note that we could also store the index in an RDBMS, although a DBM file is certainly adequate for many sites. The method for creating indexes is the same no matter what type of format we use for storage. Getopt::Long helps us handle command-line options, and stem.pl , a Perl 4 library, has algorithms to automatically "stem" (or remove) word suffixes. We use the DB_CACHE constant to hold the size of the DB_File memory cache. Increasing the size of this cache (up to a certain point) improves insertion rate at the expense of memory. In other words, it increases the rate at which we store the words in the index. A cache size of is used as the default. DEFAULT_INDEX contains the default path to the file that will hold our data. The user can specify a different file by using the -index option, as you will see shortly. The GetOptions function (part of the Getopt::Long module) allows us to extract any command-line options and store them in a hash. We pass a reference to a hash and a list of options to GetOptions. The options that take arguments contain an "s" to indicate that they each take a string. This application allows you to pass several options that will affect the indexing process. The -dir option is the only one that is required, as it is used to specify the directory that contains the files to be indexed. You can use the -cache option to specify the cache size and -index to specify the path to the index. The -ignore option creates an index where all the words are turned into lowercase (case-insensitive). This will increase the rate at which the index is created, as well as decrease the size of the index. If you want numbers in documents to be included in the index, you can specify the -numbers option. You can use the -stop option to specify a file that contains "stop" words -- words that are generally found in most of your documents. Typical stop words include "a", "an", "to", "it", and "the", but you can also include words that are more specific to your documents. Finally, the -stem option stems word suffixes before storing them in the index. This will help us find words in documents much easily. For example, if a user searches for "tomatoes", our search application will return documents that contain "tomato" as well as "tomatoes". An important note here is that stemming will also create a case-insensitive index. Here's an example of how you would use these various options: $ ./Indexer -dir /usr/local/apache/htdocs/sports \ -cache 16_000_000 \ -index /usr/local/apache/data/sports.db \ -stop my_stop_words.txt \ -stem %index is the hash that will hold the index. We use the tie function to bind the hash to the file specified by $conf{index}. This allows us to transparently store a hash in a file, which we can later retrieve and modify. In this example, we are using DB_File, as it is faster and more efficient that other DBM implementations. If the -stem option was used, we record this in our index so that our CGI script knows whether to apply stemming to the query as well. We could have stored this information in another database file, but that would require opening two files for each search. Instead, we name this key with an exclamation point such that it can't collide with any of the words we're indexing. We use the find function (part of File::Find module) to get a list of all the files in the specified directory. find expects the first argument to be a code reference, which can either be a reference to a subroutine or an inlined anonymous subroutine, as is the case above. As find traverses through the directory (as well as all subdirectories), it executes the code, specified by the first argument, setting the $File::Find::name variable to the path of the file. This builds an array of the path to all the files under the original directory. If a stop file was specified and it exists, we call the load_stopwords function to read through the file and return a reference to a hash. The most important function in this application is process_files, which iterates through all the files and stores the words in $index. Finally, we close the binding between the hash and the file and exit. At this point, we will have a file containing the index. Let's look at the functions now. The load_stopwords function opens the stop words file, ignores all comments (lines starting with "#"), and extracts the first word found on each line (\S+). The word is converted to lowercase by the lc function and stored as a key in the hash referenced by $words. Since we are going to find words with mixed case in our files, it is much easier and quicker to compare them to this list if all our stop words are either completely uppercase or completely lowercase. Before we discuss the process_files method, let's look at the arguments it expects. The first argument, $index, is a reference to an empty hash that will eventually contain the words from all the files as well as pointers to the documents where they are found. $files is a reference to a list of all the files to parse. $stop is a reference to a hashes containing our stop words. The final argument, $args, is simply a reference to the hash of our command-line arguments. If the user chose to ignore case, we convert all words into lowercase, thus creating a case-insensitive index. We set Perl's default input record separator, $/, to paragraph mode. In other words, one read on a file handle will return a paragraph, as opposed to a single line. This allows us to index the files at a faster rate. We iterate through the @$files array with the for function, storing the key in $file_id and the value of the current file in $file. Since this application creates a human-searchable index, we will deal only with text files. We use the -T operator to ignore any non-text files. The first entry into the %$index hash is a "unique" key that associates a number with the full path to the file. Since this hash will also hold all the words that we find, we use the "!FILE_NAME" string to keep our number to file mappings separate from the words. We start our indexing process by iterating through the file a paragraph at a time; the $_ variable holds the contents. If the -case option was specified by the user, we convert the paragraph that we have just read to lowercase. We also strip all HTML tags from the paragraph, since we don't want them to be indexed. The regexp will look for a string starting with "<", followed by one or more characters (including newlines) until it finds the first ">". We iterate through the paragraph using a regex that extracts words greater than or equal to two characters in length and matches characters as well as digits (\d matches "0-9"). The matched word is stored in $1. Before we check to see if the word we extracted is a stop word, we need to convert it to lowercase, since we converted all the stop words to lowercase earlier in this script. If the word is, indeed, a stop word, we skip it and continue. We also skip numbers if the -numbers option is not specified. If the -stem option is specified, we call the stem function (part of the stem.pl library) to remove all prefixes from the word and convert it to lowercase. Finally, we are ready to store the word in the index, where the value represents the file that we are currently parsing. Unfortunately, this isn't that simple. The last command is a little long and complicated. It helps to read it backwards. First, we check whether we have seen the word in this file previously by using the %seen_in_file hash; the first time through, there will not be an entry in the hash and will evaluate to false (and thus pass the unless check), thereafter, it will contain the number of times we have seen the number in the file and evaluate to true (and thus fail the unless check). So the first time we see the word in the file, we add it to our index. If the word was previously indexed for another file, then we join the $file_id of this file to the previous entry with a colon. Otherwise, we just add $file_id as this word's only value thus far. When this function finishes, the %$index hash will look something like this: $index = { "!FILE_NAME:1" => "/usr/local/apache/htdocs/sports/sprint.html", "!FILE_NAME:2" => "/usr/local/apache/htdocs/sports/olympics.html", "!FILE_NAME:3" => "/usr/local/apache/htdocs/sports/celtics.html", browser => "1:2", code => "3", color => "2:3", comment => "2", content => "1", cool => "2:3", copyright => "1:2:3" }; Now, we are ready to implement the CGI application that will search this index. 12.3.1. Search ApplicationThe indexer application makes our life easier when it comes time to write the CGI application to perform the actual search. The CGI application should parse the form input, open the DBM file created by the indexer, search for possible matches and then return HTML output. Example 12-4 contains the program. Example 12-4. indexed_search.cgi#!/usr/bin/perl -wT use DB_File; use CGI; use CGIBook::Error; use File::Basename; require stem.pl; use strict; use constant INDEX_DB => "/usr/local/apache/data/index.db"; my( %index, $paths, $path ); my $q = new CGI; my $query = $q->param("query"); my @words = split /\s*(,|\s+)/, $query; tie %index, "DB_File", INDEX_DB, O_RDONLY, 0640 or error( $q, "Cannot open database" ); $paths = search( \%index, \@words ); print $q->header, $q->start_html( "Inverted Index Search" ), $q->h1( "Search for: $query" ); unless ( @$paths ) { print $q->h2( $q->font( { -color => "#FF000" }, "No Matches Found" ) ); } foreach $path ( @$paths ) { my $file = basename( $path ); next unless $path =~ s/^\Q$ENV{DOCUMENT_ROOT}\E//o; $path = to_uri_path( $path ); print $q->a( { -href => "$path" }, "$path" ), $q->br; } print $q->end_html; untie %index; sub search { my( $index, $words ) = @_; my $do_stemming = exists $index->{"!OPTION:stem"} ? 1 : 0; my $ignore_case = exists $index->{"!OPTION:ignore"} ? 1 : 0; my( %matches, $word, $file_index ); foreach $word ( @$words ) { my $match; if ( $do_stemming ) { my( $stem ) = stem( $word ); $match = $index->{$stem}; } elsif ( $ignore_case ) { $match = $index->{lc $word}; } else { $match = $index->{$word}; } next unless $match; foreach $file_index ( split /:/, $match ) { my $filename = $index->{"!FILE_NAME:$file_index"}; $matches{$filename}++; } } my @files = map { $_->[0] } sort { $matches{$a->[0]} <=> $matches{$b->[0]} || $a->[1] <=> $b->[1] } map { [ $_, -M $_ ] } keys %matches; return \@files; } sub to_uri_path { my $path = shift; my( $name, @elements ); do { ( $name, $path ) = fileparse( $path ); unshift @elements, $name; chop $path; } while $path; return join '/', @elements; } The modules should be familiar to you by now. The INDEX_DB constant contains the path of the index created by the indexer application. Since a query can include multiple words, we split it on any whitespace or a comma and store the resulting words in the @words array. We use tie to open the index DBM file in read-only mode. In other words, we bind the index file with the %index hash. If we cannot open the file, we call our error function to return an error to the browser. The real searching is done appropriately enough in the search function, which takes a reference to the index hash and a reference to the list of words we are searching for. The first thing we do is to peek into the index and see if the stem option was set when the index was built. We then proceed to iterate through the @$words array, searching for possible matches. If stemming was enabled, we stem the word and compare that. Otherwise, we check to see whether the particular word exists in the index as-is, or as a lowercase word if the index is not case-sensitive. If any of these comparisons succeeds, we have got a match. Otherwise, we ignore the word and continue. If there is a match, we split the colon separated list of file id's where that particular word is found. Since we don't want duplicate entries in our final list, we store the full path of the matching files in the %matches hash. After the loop has finished executing, we are left with the matching files in %matches. We would like to add some order to our results and display them according to the number of words matching and then by the file's modification time. So, we sort the keys according to the number of matches and then by the data returned by the -M operator, and store the recently modified files in the @files array. We could calculate the modification time of the files during each comparison like this: my @files = sort { $matches{$_} <=> $matches{$_} || -M $_ <=> -M $_ } keys %matches; However, this is inefficient because we might calculate the modification time for each file multiple times. A more efficient algorithm involves precalculating the modification times as we have done in the program. This strategy has become known as the Schwartzian Transform, made famous by Randal Schwartz. It's beyond the scope of this book to explain this, but if you're interested, see Joseph Hall's explanation of the Transform, located at: http://www.5sigma.com/perl/schwtr.html. Ours is a slight variation because we perform a two-part sort. We output the HTTP and HTML document headers, and proceed to check to see if we have any matches. If not, we return a simple message. Otherwise, we iterate through the @files array, setting $path to the current element each time through the loop. We strip off the part of the path that matches the server's root directory. That should give us the path that corresponds to a URL. However, on non-Unix filesystems, we won't have forward slashes ("/") separating directories. So we call the to_uri_path function, which uses the File::Basename module to strip off successive elements of the path and then rebuild it with forward slashes. Note that this will work on many operating systems like Win32 and MacOS, but it will not work on systems that do not use a single character to delimit parts of the path (like VMS; although, the chances that you're actually doing CGI development on a VMS machine are pretty slim). We build proper links with this newly formatted path, print the remainder of our results, close the binding between the database and the hash, and exit. Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|