One of the toughest things to learn about regular expressions
is just what they do match.
The problem is that a regular expression tends to find the longest
possible match - which can be more than you want.
showmatch
|
Here's a simple script called
showmatch
that is
useful for testing regular expressions, when writing
sed
scripts, etc. Given a regular expression and a filename, it
finds lines in the file matching that expression, just like
grep
, but
it uses a row of carets (
^^^^
) to highlight the portion of the line
that was actually matched. |
#! /bin/sh
# showmatch - mark string that matches pattern
pattern=$1; shift
nawk 'match($0,pattern) > 0 {
s = substr($0,1,RSTART-1)
m = substr($0,1,RLENGTH)
gsub (/[^\b- ]/, " ", s)
gsub (/./, "^", m)
printf "%s\n%s%s\n", $0, s, m
}' pattern="$pattern" $*
For example:
%
showmatch 'CD-...' mbox
and CD-ROM publishing. We have recognized
^^^^^^
that documentation will be shipped on CD-ROM; however,
^^^^^^
xgrep
|
xgrep
is a related script that simply retrieves only the matched text.
This allows you to extract patterned data from a file.
For example, you could extract only the numbers from a table
containing both text and numbers.
It's also great for counting the number of occurrences of some pattern
in your file, as shown below.
Just be sure that your expression only matches what you want.
If you aren't sure, leave off the
wc
command and glance at the
output.
For example, the regular expression
[0-9]*
will match numbers
like
3.2
twice
: once for the
3
and again for the
2
!
You want to include a dot (
.
) and/or comma (
,
),
depending on how your numbers are written.
For example:
[0-9][.0-9]*
matches a leading digit, possibly
followed by more dots and digits. |
NOTE:
Remember that an expression like
[0-9]*
will match
zero
numbers
(because
*
means "zero or more of the preceding character").
That expression can make
xgrep
run for a very long time!
The following expression, which matches
one
or more digits,
is probably what you want instead:
xgrep "[0-9][0-9]*"
files
| wc -l
The
xgrep
shell script runs the
sed
commands below,
replacing
$re
with the regular expression from the command line
and
$x
with a CTRL-b character (which is used as a delimiter).
We've shown the
sed
commands numbered, like
5>
;
these are only for reference and aren't part of the script:
1>
\$x$re$x!d
2>
s//$x&$x/g
3>
s/[^$x]*$x//
4>
s/$x[^$x]*$x/\
/g
5>
s/$x.*//
Command 1
deletes all input lines that don't contain a match.
On the remaining lines (which do match),
command 2
surrounds the
matching text with CTRL-b delimiter characters.
Command 3
removes all characters (including the first delimiter)
before the first match on a line.
When there's more than one match on a line,
command 4
breaks the
multiple matches onto separate lines.
Command 5
removes the last delimiter, and any text after it, from
every output line.
Greg Ubben revised
showmatch
and wrote
xgrep
.