18.8. Processing Every Word in a File18.8.2. SolutionRead in each line with fgets( ), separate the line into words, and process each word: $fh = fopen('great-american-novel.txt','r') or die($php_errormsg); while (! feof($fh)) { if ($s = fgets($fh,1048576)) { $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY); // process words } } fclose($fh) or die($php_errormsg); 18.8.3. DiscussionHere's how to calculate average word length in a file: $word_count = $word_length = 0; if ($fh = fopen('great-american-novel.txt','r')) { while (! feof($fh)) { if ($s = fgets($fh,1048576)) { $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY); foreach ($words as $word) { $word_count++; $word_length += strlen($word); } } } } print sprintf("The average word length over %d words is %.02f characters.", $word_count, $word_length/$word_count); Processing every word proceeds differently depending on how "word" is defined. The code in this recipe uses the Perl-compatible regular-expression engine's \s whitespace metacharacter, which includes space, tab, newline, carriage return, and formfeed. Section 2.6 breaks apart a line into words by splitting on a space, which is useful in that recipe because the words have to be rejoined with spaces. The Perl-compatible engine also has a word-boundary assertion (\b) that matches between a word character (alphanumeric) and a nonword character (anything else). Using \b instead of \s to delimit words most noticeably treats differently words with embedded punctuation. The term 6 o'clock is two words when split by whitespace (6 and o'clock); it's four words when split by word boundaries (6, o, ', and clock). 18.8.4. See AlsoSection 13.3 discusses regular expressions to match words; Section 1.5 for breaking apart a line by words; documentation on fgets( ) at http://www.php.net/fgets, on preg_split( ) at http://www.php.net/preg-split, and on the Perl-compatible regular expression extension at http://www.php.net/pcre. Copyright © 2003 O'Reilly & Associates. All rights reserved. |
|