9.7 Regular Expressions and the re Module
A regular
expression is a string that represents a
pattern. With regular expression functionality, you can compare that
pattern to another string and see if any part of the string matches
the pattern.
The re module
supplies all of Python's regular expression
functionality. The compile function builds a
regular expression object from a pattern string and optional flags.
The methods of a regular expression object look for matches of the
regular expression in a string and/or perform substitutions. Module
re also exposes functions equivalent to a regular
expression's methods, but with the regular
expression's pattern string as their first argument.
Regular expressions can be difficult to master, and this book does
not purport to teach them—I cover only the ways in which you
can use them in Python. For general coverage of regular expressions,
I recommend the book Mastering
Regular Expressions, by
Jeffrey Friedl (O'Reilly). Friedl's
book offers thorough coverage of regular expressions at both the
tutorial and advanced levels.
9.7.1 Pattern-String Syntax
The pattern string representing a
regular expression follows a specific syntax:
Alphabetic and numeric characters stand for themselves. A regular
expression whose pattern is a string of letters and digits matches
the same string.
Many alphanumeric characters acquire special meaning in a pattern
when they are preceded by a backslash
(\).
Punctuation works the other way around. A punctuation character is
self-matching when escaped, and has a special meaning when
unescaped.
The backslash character itself is matched by a repeated backslash
(i.e., the pattern \\).
Since regular expression patterns often contain backslashes, you
generally want to specify them using raw-string syntax (covered in
Chapter 4). Pattern elements (e.g.,
r'\t', which is equivalent to the non-raw string
literal '\\t') do match the corresponding special
characters (e.g., the tab character '\t').
Therefore, you can use raw-string syntax even when you do need a
literal match for some such special character.
Table 9-2 lists the special elements in regular
expression pattern syntax. The exact meanings of some pattern
elements change when you use optional flags, together with the
pattern string, to build the regular expression object. The optional
flags are covered later in this chapter.
Table 9-2. Regular expression pattern syntax
.
|
Matches any character except \n (if
DOTALL, also matches \n)
|
^
|
Matches start of string (if MULTILINE, also
matches after \n)
|
$
|
Matches end of string (if MULTILINE, also matches
before \n)
|
*
|
Matches zero or more cases of the previous regular expression; greedy
(match as many as possible)
|
+
|
Matches one or more cases of the previous regular expression; greedy
(match as many as possible)
|
?
|
Matches zero or one case of the previous regular expression; greedy
(match one if possible)
|
*?
, +?,
??
|
Non-greedy versions of *, +,
and ? (match as few as possible)
|
{m,n}
|
Matches m to n
cases of the previous regular expression (greedy)
|
{m,n}?
|
Matches m to n
cases of the previous regular expression (non-greedy)
|
[...]
|
Matches any one of a set of characters contained within the brackets
|
|
|
Matches expression either preceding it or following it
|
(...)
|
Matches the regular expression within the parentheses and also
indicates a group
|
(?iLmsux)
|
Alternate way to set optional flags; no effect on match
|
(?:...)
|
Like (...), but does not indicate a group
|
(?P<id>...)
|
Like (...), but the group also gets the name
id
|
(?P=id)
|
Matches whatever was previously matched by group named
id
|
(?#...)
|
Content of parentheses is just a comment; no effect on match
|
(?=...)
|
Lookahead assertion; matches if regular expression
... matches what comes next, but does not consume
any part of the string
|
(?!...)
|
Negative lookahead assertion; matches if regular expression
... does not match what comes next, and does not
consume any part of the string
|
(?<=...)
|
Lookbehind assertion; matches if there is a match for regular
expression ... ending at the current position
(... must match a fixed length)
|
(?<!...)
|
Negative lookbehind assertion; matches if there is no match for
regular expression ... ending at the current
position (... must match a fixed length)
|
\number
|
Matches whatever was previously matched by group numbered
number (groups are automatically numbered
from 1 up to 99)
|
\A
|
Matches an empty string, but only at the start of the whole string
|
\b
|
Matches an empty string, but only at the start or end of a word (a
maximal sequence of alphanumeric characters; see also
\w)
|
\B
|
Matches an empty string, but not at the start or end of a word
|
\d
|
Matches one digit, like the set [0-9]
|
\D
|
Matches one non-digit, like the set [^0-9]
|
\s
|
Matches a whitespace character, like the set [
\t\n\r\f\v]
|
\S
|
Matches a non-white character, like the set [^
\t\n\r\f\v]
|
\w
|
Matches one alphanumeric character; unless LOCALE
or UNICODE is set, \w is like
[a-zA-Z0-9_]
|
\W
|
Matches one non-alphanumeric character, the reverse of
\w
|
\Z
|
Matches an empty string, but only at the end of the whole string
|
\\
|
Matches one backslash character
|
9.7.2 Common Regular Expression Idioms
'.*'
as a substring of a regular expression's pattern
string means "any number of repetitions (zero or
more) of any character." In other words,
'.*' matches any substring of a target string,
including the empty substring. '.+' is similar,
but it matches only a non-empty substring. For example:
'pre.*post'
matches a string containing a substring 'pre'
followed by a later substring 'post', even if the
latter is adjacent to the former (e.g., it matches both
'prepost' and 'pre23post'). On
the other hand:
'pre.+post'
matches only if 'pre' and
'post' are not adjacent (e.g., it matches
'pre23post' but does not match
'prepost'). Both patterns also match strings that
continue after the 'post'.
To constrain a pattern to match only strings that end with
'post', end the pattern with
\Z. For example:
r'pre.*post\Z'
matches 'prepost', but not
'preposterous'. Note that we need to express the
pattern with raw-string syntax (or escape the backslash
\ by doubling it into \\), as
it contains a backslash. Using raw-string syntax for all regular
expression pattern literals is good practice in Python, as
it's the simplest way to ensure
you'll never fail to escape a backslash.
Another frequently used element in regular expression patterns is
\b, which matches a word boundary. If you want to
match the word 'his' only as a whole word and not
its occurrences as a substring in such words as
'this' and 'history', the
regular expression pattern is:
r'\bhis\b'
with word boundaries both before and after. To match the beginning of
any word starting with 'her', such as
'her' itself but also
'hermetic', but not words that just contain
'her' elsewhere, such as
'ether', use:
r'\bher'
with a word boundary before, but not after, the relevant string. To
match the end of any word ending with 'its', such
as 'its' itself but also
'fits', but not words that contain
'its' elsewhere, such as
'itsy', use:
r'its\b'
with a word boundary after, but not before, the relevant string. To
match whole words thus constrained, rather than just their beginning
or end, add a pattern element \w* to match zero or
more word characters. For example, to match any full word starting
with 'her', use:
r'\bher\w*'
And to match any full word ending with 'its', use:
r'\w*its\b'
9.7.3 Sets of Characters
You denote sets of characters in a
pattern by listing the characters within brackets ([
]). In addition to listing single characters, you can
denote a range by giving the first and last characters of the range
separated by a hyphen (-). The last character of
the range is included in the set, which is different from other
Python ranges. Within a set, special characters stand for themselves,
except \, ], and
-, which you must escape (by preceding them with a
backslash) when their position is such that, unescaped, they would
form part of the set's syntax. In a set, you can
also denote a class of characters by escaped-letter notation, such as
\d or \S. However,
\b in a set denotes a backspace character, not a
word boundary. If the first character in the set's
pattern, right after the [, is a caret
(^), the set is complemented.
In other words, the set matches any character except those that
follow ^ in the set pattern
notation.
A frequent use of character sets is to match a word, using a
definition of what characters can make up a word that differs from
\w's default (letters and
digits). To match a word of one or more characters, each of which can
be a letter, an apostrophe, or a hyphen, but not a digit (e.g.,
'Finnegan-O'Hara'), use:
r"[a-zA-z'\-]+"
It's not strictly necessary to escape the hyphen
with a backslash in this case, since its position makes it
syntactically unambiguous. However, the backslash makes the pattern
somewhat more readable, by visually distinguishing the hyphen that
you want to have as a character in the set from those used to denote
ranges.
9.7.4 Alternatives
A
vertical bar (|) in a regular expression pattern,
used to specify alternatives, has low precedence. Unless parentheses
change the grouping, | applies to the whole
pattern on either side, up to the start or end of the string, or to
another |. A pattern can be made up of any number
of subpatterns joined by |. To match such a
regular expression, the first subpattern is tried first, and if it
matches, the others are skipped. If the first subpattern does not
match, the second subpattern is tried, and so on.
| is neither greedy nor non-greedy, as it
doesn't take into consideration the length of the
match.
If you have a list L of words, a regular
expression pattern that matches any of the words is:
'|'.join([r'\b%s\b' % word for word in L])
If the items of L can be more-general
strings, not just words, you need to escape each of them with
function re.escape, covered later in this chapter,
and you probably don't want the
\b word boundary markers on either side. In this
case, use the regular expression pattern:
'|'.join(map(re.escape,L))
9.7.5 Groups
A
regular expression can contain any number of groups, from none up to
99 (any number is allowed, but only the first 99 groups are fully
supported). Parentheses in a pattern string indicate a group. Element
(?P<id>...)
also indicates a group, and in addition gives the group a name,
id, that can be any Python identifier. All
groups, named and unnamed, are numbered from left to right, 1 to 99,
with group number 0 indicating the whole regular expression.
For any match of the regular expression with a string, each group
matches a substring (possibly an empty one). When the regular
expression uses |, some of the groups may not
match any substring, although the regular expression as a whole does
match the string. When a group doesn't match any
substring, we say that the group does not
participate in the match. An empty string
'' is used to represent the matching substring for
a group that does not participate in a match, except where otherwise
indicated later in this chapter.
For example:
r'(.+)\1+\Z'
matches a string made up of two or more repetitions of any non-empty
substring. The (.+) part of the pattern matches
any non-empty substring (any character, one or more times), and
defines a group thanks to the parentheses. The \1+
part of the pattern matches one or more repetitions of the group, and
the \Z anchors the match to end-of-string.
9.7.6 Optional Flags
A regular expression pattern element
with one or more of the letters
"iLmsux" between
(? and ) lets you set regular
expression options within the regular expression's
pattern, rather than by the flags argument
to function compile of module
re. Options apply to the whole regular expression,
no matter where the options element occurs in the pattern. For
clarity, options should always be at the start of the pattern.
Placement at the start is mandatory if x is among
the options, since x changes the way Python parses
the pattern.
Using the explicit
flags argument is more readable than
placing an options element within the pattern. The
flags argument to function
compile is a coded integer, built by bitwise ORing
(with Python's bitwise OR operator,
|) one or more of the following attributes of
module re. Each attribute has both a short name
(one uppercase letter), for convenience, and a long name (an
uppercase multiletter identifier), which is more readable and thus
normally preferable:
- I
or IGNORECASE
-
Makes matching case-insensitive
- L
or LOCALE
-
Causes \w, \W,
\b, and \B matches to depend on
what the current locale deems alphanumeric
- M or MULTILINE
-
Makes the special characters ^ and
$ match at the start and end of each line (i.e.,
right after/before a newline), as well as at the start and end of the
whole string
- S or DOTALL
-
Causes the special character . to match any
character, including a newline
- U or UNICODE
-
Makes \w, \W,
\b, and \B matches depend on
what Unicode deems alphanumeric
- X or VERBOSE
-
Causes whitespace in the pattern to be ignored, except when escaped
or in a character set, and makes a # character in
the pattern begin a comment that lasts until the end of the line
For example, here are three ways to define equivalent regular
expressions with function compile, covered later
in this chapter. Each of these regular expressions matches the word
"hello" in any mix of upper- and
lowercase letters:
import re
r1 = re.compile(r'(?i)hello')
r2 = re.compile(r'hello', re.I)
r3 = re.compile(r'hello', re.IGNORECASE)
The third approach is clearly the most readable, and thus the most
maintainable, even though it is slightly more verbose. Note that the
raw-string form is not necessary here, since the patterns do not
include backslashes. However, using raw strings is still innocuous,
and is the recommended style for clarity.
Option re.VERBOSE (or re.X)
lets you make patterns more readable and understandable by
appropriate use of whitespace and comments. Complicated and verbose
regular expression patterns are generally best represented by strings
that take up more than one line, and therefore you normally want to
use the triple-quoted raw-string format for such pattern strings. For
example:
repat_num1 = r'(0[0-7]*|0x[\da-fA-F]+|[1-9]\d*)L?\Z'
repat_num2 = r'''(?x) # pattern matching integer numbers
(0 [0-7]* | # octal: leading 0, then 0+ octal digits
0x [\da-f-A-F]+ | # hex: 0x, then 1+ hex digits
[1-9] \d* ) # decimal: leading non-0, then 0+ digits
L?\Z # optional trailing L, then end of string
'''
The two patterns defined in this example are equivalent, but the
second one is made somewhat more readable by the comments and the
free use of whitespace to group portions of the pattern in logical
ways.
9.7.7 Match Versus Search
So far, we've
been using regular expressions to match strings. For example, the
regular expression with pattern r'box' matches
strings such as 'box' and
'boxes', but not 'inbox'. In
other words, a regular expression match can be considered as
implicitly anchored at the start of the target string, as if the
regular expression's pattern started with
\A.
Often, you're interested in locating possible
matches for a regular expression anywhere in the string, without any
anchoring (e.g., find the r'box' match inside such
strings as 'inbox', as well as in
'box' and 'boxes'). In this
case, the Python term for the operation is a
search, as opposed to a match. For such
searches, you use the search method of a regular
expression object, while the match method only
deals with matching from the start. For example:
import re
r1 = re.compile(r'box')
if r1.match('inbox'): print 'match succeeds'
else print 'match fails' # prints: match fails
if r1. search('inbox'): print 'search succeeds' # prints: search succeeds
else print 'search fails'
9.7.8 Anchoring at String Start and End
The pattern elements
ensuring that a regular expression search (or match) is anchored at
string start and string end are \A and
\Z respectively. More traditionally, elements
^ for start and $ for end are
also used in similar roles. ^ is the same as
\A, and $ is the same as
\Z, for regular expression objects that are not
multiline (i.e., that do not contain pattern element
(?m) and are not compiled with the flag
re.M or re.MULTILINE). For a
multiline regular expression object, however, ^
anchors at the start of any line (i.e., either at the start of the
whole string or at the position right after a newline character
\n). Similarly, with a multiline regular
expression, $ anchors at the end of any line
(i.e., either at the end of the whole string or at the position right
before \n). On the other hand,
\A and \Z anchor at the start
and end of the whole string whether the regular expression object is
multiline or not. For example, here's how to check
if a file has any lines that end with digits:
import re
digatend = re.compile(r'\d$', re.MULTILINE)
if re.search(open('afile.txt').read( )): print "some lines end with digits"
else: print "no lines end with digits"
A pattern of r'\d\n' would be almost equivalent,
but in that case the search would fail if the very last character of
the file were a digit not followed by a terminating end-of-line
character. With the example above, the search succeeds if a digit is
at the very end of the file's contents, as well as
in the more usual case where a digit is followed by an end-of-line
character.
9.7.9 Regular Expression Objects
A regular expression object
r has the following read-only attributes
detailing how r was built (by function
compile of module re, covered
later in this chapter):
- flags
-
The flags argument passed to
compile, or 0 when
flags is omitted
- groupindex
-
A dictionary whose keys are group names as defined by elements
(?P<id>);
the corresponding values are the named groups'
numbers
- pattern
-
The pattern string from which r is compiled
These attributes make it easy to get back from a compiled regular
expression object to its pattern string and flags, so you never have
to store those separately.
A regular expression object r also
supplies methods to locate matches for
r's regular expression
within a string, as well as to perform substitutions on such matches.
Matches are generally represented by special objects, covered in the
later Section 9.7.10.
When r has no groups,
findall returns a list of strings, each a
substring of s that is a non-overlapping
match with r. For example,
here's how to print out all words in a file, one per
line:
import re
reword = re.compile(r'\w+')
for aword in reword.findall(open('afile.txt').read( )):
print aword When r has one group,
findall also returns a list of strings, but each
is the substring of s matching
r's group. For example,
if you want to print only words that are followed by whitespace (not
punctuation), you need to change only one statement in the previous
example:
reword = re.compile('(\w+)\s') When r has n
groups (where n is greater than
1), findall returns a list of
tuples, one per non-overlapping match with
r. Each tuple has
n items, one per group of
r, the substring of
s matching the group. For example,
here's how to print the first and last word of each
line that has at least two words:
import re
first_last = re.compile(r'^\W*(\w+)\b.*\b(\w+)\W*$',
re.MULTILINE)
for first, last in \
first_last.findall(open('afile.txt').read( )):
print first, last
r.match(s,start=0,end=sys.maxint)
|
|
Returns an appropriate match object when a substring of
s, starting at index
start and not reaching as far as index
end, matches r.
Otherwise, match returns None.
Note that match is implicitly anchored at the
starting position start in
s. To search for a match with
r through s,
from start onwards, call
r.search, not
r.match. For example,
here's how to print all lines in a file that start
with digits:
import re
digs = re.compile(r'\d+')
for line in open('afile.txt'):
if digs.match(line): print line,
r.search(s,start=0,end=sys.maxint)
|
|
Returns an appropriate match object for the leftmost substring of
s, starting not before index
start and not reaching as far as index
end, that matches
r. When no such substring exists,
search returns None. For
example, to print all lines containing digits, one simple approach is
as follows:
import re
digs = re.compile(r'\d+')
for line in open('afile.txt'):
if digs.search(line): print line,
Returns a list L of the splits of
s by r (i.e.,
the substrings of s that are separated by
non-overlapping, non-empty matches with
r). For example, to eliminate all
occurrences of substring 'hello' from a string, in
any mix of lowercase and uppercase letters, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = ''.join(rehello.split(astring)) When r has n
groups, n more items are interleaved in
L between each pair of splits. Each of the
n extra items is the substring of
s matching
r's corresponding group
in that match, or None if that group did not
participate in the match. For example, here's one
way to remove whitespace only when it occurs between a colon and a
digit:
import re
re_col_ws_dig = re.compile(r'(:)\s+(\d)')
astring = ''.join(re_col_ws_dig.split(astring)) If maxsplit is greater than
0, at most
maxsplit splits are in
L, each followed by
n items as above, while the trailing
substring of s after
maxsplit matches of
r, if any, is
L's last item. For
example, to remove only the first occurrence of substring
'hello' rather than all of them, change the last
statement in the first example above to:
astring = ''.join(rehello.split(astring, 1))
Returns a copy of s where non-overlapping
matches with r are replaced by
repl, which can be either a string or a
callable object, such as a function. An empty match is replaced only
when not adjacent to the previous match. When
count is greater than
0, only the first count
matches of r within
s are replaced. When
count equals 0, all
matches of r within
s are replaced. For example,
here's another way to remove only the first
occurrence of substring 'hello' in any mix of
cases:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = rehello.sub('', astring, 1) Without the final 1 argument to method
sub, this example would remove all occurrences of
'hello'.
When repl is a callable object,
repl must accept a single argument (a
match object) and return a string to use as the replacement for the
match. In this case, sub calls
repl, with a suitable match-object
argument, for each match with r that
sub is replacing. For example, to uppercase all
occurrences of words starting with 'h' and ending
with 'o' in any mix of cases, you can use the
following:
import re
h_word = re.compile(r'\bh\w+o\b', re.IGNORECASE)
def up(mo): return mo.group(0).upper( )
astring = h_word.sub(up, astring) Method sub is a good way to get a callback to a
callable you supply for every non-overlapping match of
r in s, without
an explicit loop, even when you don't need to
perform any substitution. The following example shows this by using
the sub method to build a function that works just
like method findall for a regular expression
without groups:
import re
def findall(r, s):
result = [ ]
def foundOne(mo): result.append(mo.group( ))
r.sub(foundOne, s)
return result The example needs Python 2.2, not just because it uses lexically
nested scopes, but because in Python 2.2 re
tolerates repl returning
None and treats it as if it returned
'', while in Python 2.1 re was
more pedantic and insisted on repl
returning a string.
When repl is a string,
sub uses repl itself as
the replacement, except that it expands back references. A
back reference is a substring of
repl of the form
\g<id>,
where id is the name of a group in
r (as established by syntax
(?P<id>)
in r's pattern string),
or \dd, where
dd is one or two digits, taken as a group
number. Each back reference, whether named or numbered, is replaced
with the substring of s matching the group
of r that the back reference indicates.
For example, here's how to enclose every word in
braces:
import re
grouped_word = re.compile('(\w+)')
astring = grouped_word.sub(r'{\1}', astring)
subn is the same as sub, except
that subn returns a pair
(new_string,
n) where
n is the number of substitutions that
subn has performed. For example, to count the
number of occurrences of substring 'hello' in any
mix of cases, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
junk, count = rehello.subn('', astring)
print 'Found', count, 'occurrences of "hello"'
9.7.10 Match Objects
Match objects are created and returned by
methods match and search of a
regular expression object. There are also implicitly created by
methods sub and subn when
argument repl is callable, since in that
case a suitable match object is passed as the actual argument on each
call to repl. A match object
m supplies the following attributes
detailing how m was
created:
- pos
-
The start argument that was passed to
search or match (i.e., the
index into s where the search for a match
began)
- endpos
-
The end argument that was passed to
search or match (i.e., the
index into s before which the matching
substring of s had to end)
- lastgroup
-
The name of the last-matched group (None if the
last-matched group has no name, or if no group participated in the
match)
- lastindex
-
The integer index (1 and up) of the last-matched group
(None if no group participated in the match)
- re
-
The regular expression object r whose
method created m
- string
-
The string s passed to
match, search,
sub, or subn
A match object m also supplies several
methods.
m.end(groupid=0)
m.span(groupid=0)
m.start(groupid=0)
|
|
These methods return the delimiting indices, within
m.string, of the
substring matching the group identified by
groupid, where
groupid can be a group number or name.
When the matching substring is
m.string[i:j],
m.start returns
i,
m.end returns
j, and
m.span returns
(i,
j). When the group did
not participate in the match, i and
j are -1.
Returns a copy of s where escape sequences
and back references are replaced in the same way as for method
r.sub, covered in the
previous section.
m.group(groupid=0,*groupids)
|
|
When called with a single argument groupid
(a group number or name), group returns the
substring matching the group identified by
groupid, or None if
that group did not participate in the match. The common idiom
m.group( ), also
spelled m.group(0),
returns the whole matched substring, since group number
0 implicitly means the whole regular expression.
When group is called with multiple arguments, each
argument must be a group number or name. group
then returns a tuple with one item per argument, the substring
matching the corresponding group, or None if that
group did not participate in the match.
Returns a tuple with one item per group in
r. Each item is the substring matching the
corresponding group, or default if that
group did not participate in the match.
m.groupdict(default=None)
|
|
Returns a dictionary whose keys are the names of all named groups in
r. The value for each name is the
substring matching the corresponding group, or
default if that group did not participate
in the match.
9.7.11 Functions of Module re
The re module supplies
the attributes listed in the earlier section Section 9.7.6. It also provides a function
that corresponds to each method of a regular expression object
(findall, match,
search, split,
sub, and subn), each with an
additional first argument, a pattern string that the function
implicitly compiles into a regular expression object.
It's generally preferable to compile pattern strings
into regular expression objects explicitly and call the regular
expression object's methods, but sometimes, for a
one-off use of a regular expression pattern, calling functions of
module re can be slightly handier. For example, to
count the number of occurrences of substring
'hello' in any mix of cases, one function-based
way is:
import re
junk, count = re.subn(r'(?i)hello', '', astring)
print 'Found', count, 'occurrences of "hello"'
In cases such as this one, regular expression options (here, for
example, case insensitivity) must be encoded as regular expression
pattern elements (here, (?i)), since the functions
of module re do not accept a
flags argument.
Module re also supplies error,
the class of exceptions raised upon errors (generally, errors in the
syntax of a pattern string), and two additional functions.
Creates and returns a regular expression object, parsing string
pattern as per the syntax covered in Section 9.7.1, and using
integer flags as in the section Section 9.7.6, both earlier in this
chapter.
Returns a copy of string s where each
non-alphanumeric character is escaped (i.e., preceded by a backslash
\). This is handy when you need to match string
s literally as part (or all) of a regular
expression pattern string.
|