6.16. Detecting Duplicate WordsProblemYou want to check for doubled words in a document. SolutionUse backreferences in your regular expression. Discussion
Parentheses in a pattern make the regular expression engine remember what matched that part of the pattern. Later in your pattern, you can refer to the actual string that matched with This sample code reads its input files by paragraph, with the definition of paragraph following Perl's notion of two or more contiguous newlines. Within each paragraph, it finds all duplicate words. It ignores case and can match across newlines.
Here we use $/ = ''; # paragrep mode while (<>) { while ( m{ \b # start at a word boundary (begin letters) (\S+) # find chunk of non-whitespace \b # until another word boundary (end letters) ( \s+ # separated by some whitespace \1 # and that very same chunk again \b # until another word boundary ) + # one or more sets of those }xig ) { print "dup word '$1' at paragraph $.\n"; } } That code finds the duplicated test in the following paragraph: This is a test test of the duplicate word finder.
The use of a word boundary anchor surrounding
Here's another interesting demonstration of using backreferences. Imagine you had two words in which the end of the first word was the same as the start of the next one, such as Conventional byte-by-byte processing the way a C programmer would write it would take a great deal of tricky code to solve this problem. But with a backtracking pattern matcher, it just takes one simple pattern match.
$a = 'nobody';
$b = 'bodysnatcher';
if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {
print "$2 overlaps in $1-$2-$3\n";
}
You might think that
Actually, that won't quite work out if the overlap is itself the product of a doubling, as in /^(\w+?)(\w+) \2(\w+)$/, which solves this problem. Backtracking is more powerful than you imagine. Example 6.11 offers another take on the prime factorization problem from Chapter 2, Numbers . Example 6.11: prime-pattern#!/usr/bin/perl # prime_pattern -- find prime factors of argument using pattern matching for ($N = ('o' x shift); $N =~ /^(oo+?)\1+$/; $N =~ s/$1/o/g) { print length($1), " "; } print length ($N), "\n"; Although not practical, this approach marvelously demonstrates the power of backtracking and is therefore very instructional.
Here's another example. Using a brilliant insight first illustrated by Doug McIlroy (or so says Andrew Hume), you can find solutions to Diophantine equations of order one with regular expressions. Consider the equation
# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z) =
(('o' x 281) =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
($x, $y, $z) = (length($X), length($Y), length($Z));
print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
print "No solution.\n";
}
Because the first ('o' x 281) =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/ An important lesson to be learned from these amazing feats of mathematical prowess by a lowly pattern matcher is that a pattern matching engine, particularly a backtracking one, very much wants to give you an answer and will work phenomenally hard to do so. But solving a regular expression with backreferences can take time exponentially proportional to the length of the input to complete. For all but trivial inputs, such algorithms make continental drift seem brisk. See AlsoThe explanation of backreferences in the "Regular Expressions" section of perlre (1), and in "the fine print" section of Chapter 2 of Programming Perl ; the "The Doubled-Word Thing" section in Chapter 2 of Mastering Regular Expressions |
|