|
» |
|
|
|
NAMEregexp — regular expression and pattern matching notation definitions DESCRIPTIONA
Regular Expression
is a mechanism supported by many utilities
for locating and manipulating patterns in text.
Pattern Matching Notation
is used by shells and other utilities for file name expansion.
This manual entry defines two forms of regular expressions:
Basic Regular Expressions
and
Extended Regular Expressions;
and the one form of
Pattern Matching Notation. BASIC REGULAR EXPRESSIONSBasic regular expression (RE) notation
and construction rules apply to utilities defined as using basic REs.
Any exceptions to the following rules are noted
in the descriptions of the specific utilities that use REs. REs Matching a Single CharacterThe following REs
match a single character or a single collating element: Ordinary CharactersAn ordinary character is an RE
that matches itself.
An ordinary character is any character in the supported character set
except newline and the regular expression special characters
listed in
Special Characters
below.
An ordinary character preceded by a backslash
(\)
is treated as the ordinary character itself,
except when the character is
(,
),
{,
or
},
or the digits
1
through
9
(see
REs Matching Multiple Characters).
Matching is based on the bit pattern used for encoding the character;
not on the graphic representation of the character. Special CharactersA regular expression special character preceded by a backslash
is a regular expression that matches the special character itself.
When not preceded by a backslash,
such characters have special meaning in the specification of REs.
Regular expression special characters
and the contexts in which they have special meaning are:
- . [ \
The period, left square bracket, and backslash are special
except when used in a bracket expression (see
RE Bracket Expression). - *
The asterisk is special except when used in a bracket expression,
as the first character of a regular expression,
or as the first character following the character pair
\(
(see
REs Matching Multiple Characters). - ^
The circumflex is special when used as the first character
of an entire RE (see
Expression Anchoring)
or as the first character of a bracket expression. - $
The dollar sign is special when used as the last character of an entire RE
(see
Expression Anchoring). - delimiter
Any character used to bound (i.e., delimit) an entire RE
is special for that RE.
PeriodA period
(.),
when used outside of a bracket expression, is an RE
that matches any printable or nonprintable character except newline. RE Bracket ExpressionA bracket expression enclosed in square brackets
([ ])
is an RE
that matches a single collating element contained in the nonempty set
of collating elements represented by the bracket expression. The following rules apply to bracket expressions:
- bracket expression
A bracket expression is either a
matching list expression
or a
non-matching list expression,
and consists of one or more expressions in any order.
Expressions can be:
collating elements, collating symbols, noncollating characters,
equivalence classes, range expressions, or character classes.
The right bracket
(])
loses its special meaning and represents itself in a bracket expression
if it occurs first in the list (after an initial
^,
if any).
Otherwise, it terminates the bracket expression
(unless it is the ending right bracket for a valid collating symbol,
equivalence class, or character class, or it is the collating element
within a collating symbol or equivalence class expression).
The special characters
(period, asterisk, left bracket, and backslash) lose their special meaning
within a bracket expression. The character sequences:
(left-bracket followed by a period, equal-sign or colon) are special inside a
bracket expression and are used to delimit collating symbols, equivalence class
expressions and character class expressions.
These symbols must be followed by
a valid expression and the matching terminating
.],
=],
or
:]. - matching list
A matching list expression specifies a list that matches any one of the
characters represented in the list.
The first character in the list
cannot be the circumflex.
For example,
[abc]
is an RE that matches any of
a,
b,
or
c. - non-matching list
A
non-matching list
expression begins with a circumflex
(^),
and specifies a list that matches any character or collating element
except
newline and the characters represented in the list.
For example,
[^abc]
is an RE that matches any character except newline or
a,
b,
or
c.
The circumflex has this special meaning
only
when it occurs first in the list,
immediately following the left square bracket. - collating element
A
collating element
is a sequence of one or more characters that represents a single element
in the collating sequence as identified
via the most current setting of the locale variable
LC_COLLATE
(see
setlocale(3C)). - collating symbol
A
collating symbol
is a collating element enclosed within bracket-period
([. .])
delimiters.
Multicharacter collating elements must be represented as collating symbols
to distinguish them from single-character collating elements.
For example, if the string
ch
is a valid collating element, then
[[.ch.]]
is treated as an element matching the same string of characters,
while
ch
is treated as a simple list of the characters
c
and
h.
If the string within the bracket-period delimiters
is not a valid collating element
in the current collating sequence definition,
the symbol is treated as an invalid expression. - noncollating character
A
noncollating character
is a character that is ignored for collating purposes.
By definition, such characters cannot participate
in equivalence classes or range expressions. - equivalence class
An
equivalence class
expression represents the set of collating elements
belonging to an equivalence class.
It is expressed by enclosing any one of the collating elements
in the equivalence class within bracket-equal
([= =])
delimiters.
For example, if
a
and
A
belong to the same equivalence class, then
[[=a=]b]
and
[[=A=]b]
are each equivalent to
[aAb]. - range expression
A
range expression
represents the set of collating elements that fall between
two elements in the current collation sequence
as defined via the most current setting of the locale variable
LC_COLLATE
(see
setlocale(3C)).
It is expressed as the starting point and
the ending point separated by a hyphen
(-). The starting range point and the ending range point
must be a collating element, collating symbol,
or equivalence class expression.
An
equivalence class expression
used as an end point of a range expression is interpreted such
that all collating elements within the equivalence class
are included in the range.
For example, if the collating order is
A,
a,
B,
b,
C,
c,
ch,
D,
and
d
and
the characters
A
and
a
belong to the same equivalence class, then the expression
[[=a=]-D]
is treated as
[AaBbCc[.ch.]D]. Both starting and ending range points must be valid collating elements,
collating symbols, or equivalence class expressions,
and the ending range point must collate equal to or higher than
the starting range point;
otherwise the expression is invalid.
For example, with the above collating order and assuming that
E
is a noncollating character, then both the expressions
[[=A=]-E]
and
[d-a]
are invalid. An ending range point can also be the starting range point in a subsequent
range expression.
Each such range expression is evaluated separately.
For example, the bracket expression
[a-m-o]
is treated as
[a-mm-o]. The hyphen character is treated as itself if it occurs first (after an
initial
^,
if any) or last in the list, or as the rightmost symbol
in a range expression.
As examples, the expressions
[-ac]
and
[ac-]
are equivalent and match any of the characters
a,
c,
or
-;
the expressions
[^-ac]
and
[^ac-]
are equivalent and match any characters except newline,
a,
c,
or
-;
the expression
[%--]
matches any of the characters in the defined collating sequence between
%
and
-
inclusive;
the expression
[--@]
matches any of the characters in the defined collating sequence between
-
and
@
inclusive;
and the expression
[a--@]
is invalid, assuming
-
precedes
a
in the collating sequence. If a bracket expression must specify both
-
and
],
the
]
must be placed first (after the
^,
if any) and the
-
last within the bracket expression. - character class
A character class expression represents the set of characters belonging
to a character class, as defined via the most current setting of the
locale variable
LC_CTYPE.
It is expressed as a character class name enclosed within bracket-colon
([: :])
delimiters. Standard character class expressions supported in all locales are:
- [:alpha:]
letters - [:upper:]
upper-case letters - [:lower:]
lower-case letters - [:digit:]
decimal digits - [:xdigit:]
hexadecimal digits - [:alnum:]
letters or decimal digits - [:space:]
characters producing white-space in displayed text - [:print:]
printing characters - [:punct:]
punctuation characters - [:graph:]
characters with a visible representation - [:cntrl:]
control characters - [:blank:]
blank characters
For example, if the locale variable
LC_CTYPE
is set to
C,
the expression
[[:upper:]]
is equivalent
to
[A-Z].
Similarly the expression
[[:digit:]]
is same as
[0-9].
REs Matching Multiple CharactersThe following rules may be used to construct REs
matching multiple characters
from REs matching a single character:
- RERE
The concatenation of REs is an RE
that matches the first encountered concatenation
of the strings matched by each component of the RE.
For example, the RE
bc
matches the second and third characters of the string
abcdefabcdef. - RE*
An RE matching a single character followed by an asterisk
(*)
is an RE that matches zero or more occurrences of the RE
preceding the asterisk.
The first encountered string that permits a match is chosen,
and the matched string will encompass
the maximum number of characters permitted by the RE.
For example, in the string
abbbcdeabbbbbbcde,
both the RE
b*c
and the RE
bbb*c
are matched by the substring
bbbc
in the second through fifth positions.
An asterisk as the first character of an RE
loses this special meaning and is treated as itself. - \(RE\)
A subexpression can be defined within an RE
by enclosing it between the character pairs
\(
and
\).
Such a subexpression matches whatever it would have matched without the
\(
and
\).
Subexpressions can be arbitrarily nested.
An asterisk immediately following the
\(
loses its special meaning and is treated as itself.
An asterisk immediately following the
\)
is treated as an invalid character. - \n
The expression
\n
matches the same string of characters as was matched
by a subexpression enclosed between
\(
and
\)
preceding the
\n.
The character
n
must be a digit from
1
through
9,
specifying the
n-th
subexpression (the one that begins with the
n-th
\(
and ends with the corresponding paired
\).
For example, the expression
^\(.*\)\1$
matches a line consisting of two adjacent appearances of the same string. If the
\n
is followed by an asterisk,
it matches zero or more occurrences of the subexpression referred to.
For example, the expression
\(ab\(cd\)ef\)Z\2*Z\1
matches the string
abcdefZcdcdZabcdef. - RE\{m,n\}
An RE matching a single character followed by
\{m\},
\{m,\},
or
\{m,n\}
is an RE that matches repeated occurrences of the RE.
The values of
m
and
n
must be decimal integers in the range 0 through 255, with
m
specifying the exact or minimum number of occurrences and
n
specifying the maximum number of occurrences.
\{m\}
matches exactly
m
occurrences of the preceding RE,
\{m,\}
matches at least
m
occurrences, and
\{m,n\}
matches any number of occurrences between
m
and
n,
inclusive. The first encountered string that matches the expression is chosen;
it will contain as many occurrences of the RE as possible.
For example, in the string
abbbbbbbc
the RE
b\{3\}
is matched by characters two through four, the RE
b\{3,\}
is matched by characters two through eight, and the RE
b\{3,5\}c
is matched by characters four through nine.
Expression AnchoringAn RE can be limited to matching strings that begin or end
a line (i.e., anchored) according to the following rules:
A circumflex
(^)
as the first character of an RE
anchors the expression to the beginning of a line;
only strings starting at the first character of a line are matched by the RE.
For example, the RE
^ab
matches the string
ab
in the line
abcdef,
but not the same string in the line
cdefab. A dollar sign
($)
as the last character of an RE
anchors the expression to the end of a line;
only strings ending at the last character of a line are matched by the RE.
For example, the RE
ab$
matches the string
ab
in the line
cdefab,
but not the same string in the line
abcdef. An RE anchored by both
^
and
$
matches only strings that are lines.
For example, the RE
^abcdef$
matches only lines consisting of the string
abcdef.
The use of duplication characters (+,*) following anchors is illegal. EXTENDED REGULAR EXPRESSIONSThe extended regular expression (ERE) notation and construction rules
apply to utilities defined as using extended REs.
Any exceptions to the following rules are noted
in the descriptions of the specific utilities using EREs. EREs Matching a Single CharacterThe following EREs match a single character or a single collating element: Ordinary CharactersAn ordinary character is an ERE that matches itself.
An ordinary character is any character in the supported character set
except newline and the regular expression special characters
listed in Special Characters below.
An ordinary character preceded by a backslash
(\)
is treated as the ordinary character itself.
Matching is based on the bit pattern used for encoding the character,
not on the graphic representation of the character. Special CharactersA regular expression special character preceded by a backslash
is a regular expression that matches the special character itself.
When not preceded by a backslash, such characters have special meaning
in the specification of EREs.
The extended regular expression special characters
and the contexts in which they have their special meaning are:
- . [ \ ( ) * + ? $ |
The period, left square bracket, backslash, left parenthesis,
right parenthesis, asterisk, plus sign, question mark, dollar sign,
and vertical bar are special except when used in a bracket expression
(see
ERE Bracket Expression). - ^
The circumflex is special except when used
in a bracket expression in a non-leading position. - delimiter
Any character used to bound (i.e., delimit) an entire ERE
is special for that ERE.
PeriodA period
(.),
when used outside of a bracket expression, is an ERE
that matches any printable or nonprintable character except newline. ERE Bracket ExpressionThe syntax and rules for ERE bracket expressions are the same as for RE
bracket expressions found above. EREs Matching Multiple CharactersThe following rules may be used to construct EREs
matching multiple characters from EREs
matching a single character:
- EREERE
A concatenation of EREs
matches the first encountered concatenation of
the strings matched by each component of the ERE.
Such a concatenation of EREs
enclosed in parentheses matches whatever the concatenation
without the parentheses matches.
For example, both the ERE
bc
and the ERE
(bc)
matches the second and third characters of the string
abcdefabcdef.
The longest overall string is matched. - ERE+
The special character plus
(+),
when following an ERE
matching a single character, or a concatenation of EREs
enclosed in parenthesis, is an ERE
that matches one or more occurrences of the ERE
preceding the plus sign.
The string matched will contain as many occurrences as possible.
For example, the ERE
b+c
matches the fourth through seventh characters in the string
acabbbcde. - ERE*
The special character asterisk
(*),
when following an ERE
matching a single character, or a concatenation of EREs
enclosed in parenthesis, is an ERE
that matches zero or more occurrences of the ERE
preceding the asterisk.
For example, the ERE
b*c
matches the first character in the string
cabbbcde.
If there is any choice, the longest left-most string
that permits a match is chosen.
For example, the ERE
b*cd
matches the third through seventh characters in the string
cabbbcdebbbbbbcdbc. - ERE?
The special character question mark
(?),
when following an ERE
matching a single character, or a concatenation of EREs
enclosed in parenthesis, is an ERE
that matches zero or one occurrences of the ERE
preceding the question mark.
The string matched will contain as many occurrences as possible.
For example, the ERE
b?c
matches the second character in the string
acabbbcde. - ERE{m,n}
interval expression that functions the same way
as basic regular expression syntax,
ERE\{m,n\}
AlternationTwo EREs separated by the special character vertical bar
(|)
matches a string that is matched by either ERE.
For example, the ERE
((ab)|c)d
matches the string
abd
and the string
cd.
A vertical bar '|' may not
appear as follows:
may not appear first or last in an ERE. may not appear immediately following a vertical bar. may not appear after a left parenthesis. may not appear immediately preceding a right parenthesis.
PrecedenceThe order of precedence is as follows, from high to low: - [ ]
square brackets - * + ?
asterisk, plus sign, question mark - ^ $
anchoring concatenation - |
alternation
For example, the ERE
abba|cde
is interpreted as "match either
abba
or
cde.
It does not mean "match
abb
followed by
a
or
c
followed in turn by
de
(because concatenation has a higher order of precedence than alternation). Expression AnchoringAn ERE can be limited to matching strings that begin or end a line
(i.e., anchored) according to the following rules:
A circumflex
(^)
matches the beginning of a line
(anchors the expression to the beginning of a line).
For example, the ERE
^ab
matches the string
ab
in the line
abcdef,
but not the same string in the line
cdefab. A dollar sign
($)
matches the end of a line (anchors the expression to the end of a line).
For example, the ERE
ab$
matches the string
ab
in the line
cdefab,
but not the same string in the line
abcdef. An ERE anchored by both
^
and
$
matches only strings that are lines.
For example, the ERE
^abcdef$
matches only lines consisting of the string
abcdef.
Only empty lines match the ERE
^$.
The use of duplication characters (+,*) following anchors is illegal. PATTERN MATCHING NOTATIONThe following rules apply to pattern matching notation except as noted
in the descriptions of the specific utilities using pattern matching. Patterns Matching a Single CharacterThe following patterns match a single character or a single collating element: Ordinary CharactersAn ordinary character is a pattern that matches itself.
An ordinary character is any character in the supported character set
except newline and the pattern matching special characters
listed in Special Characters below.
Matching is based on the bit pattern used for encoding the character,
not on the graphic representation of the character. Special CharactersA pattern matching special character preceded by a backslash
(\)
is a pattern that matches the special character itself.
When not preceded by a backslash, such characters have special meaning
in the specification of patterns.
The pattern matching special characters and the contexts
in which they have their special meaning are:
- ? * [
The question mark, asterisk, and left square bracket are special except when
used in a bracket expression (see
Pattern Bracket Expression).
Question MarkA question mark
(?),
when used outside of a bracket expression, is a pattern
that matches any printable or nonprintable character except newline. Pattern Bracket ExpressionThe syntax and rules for pattern bracket expressions are the same as for RE
bracket expressions found above with the following exceptions:
The exclamation point character
(!)
replaces the circumflex character
(^)
in its role in a non-matching list in the regular expression notation. The backslash is used as an escape character within bracket expressions.
Patterns Matching Multiple CharactersThe following rules may be used to construct patterns matching
multiple characters from patterns matching a single character:
- *
The asterisk
(*)
is a pattern that matches any string, including the null string. - RERE
The concatenation of patterns matching a single character
is a valid pattern that matches the concatenation of the single characters
or collating elements matched by each of the concatenated patterns.
For example, the pattern
a[bc]
matches the string
ab
and
ac. The concatenation of one or more patterns matching a single character with
one or more asterisks is a valid pattern.
In such patterns, each asterisk matches a string of zero or more characters,
up to the first character that matches the character
following the asterisk in the pattern. For example, the pattern
a*d
matches the strings
ad,
abd,
and
abcd;
but not the string
abc.
When an asterisk is the first or last character in a pattern,
it matches zero or more characters that precede
or follow the characters matched by the remainder of the pattern.
For example, the pattern
a*d*
matches the strings
ad,
abcd,
abcdef,
aaaad,
and
adddd;
the pattern
*a*d
matches the strings
ad,
abcd,
efabcd,
aaaad,
and
adddd.
Rule Qualification for Patterns Used for Filename ExpansionThe rules described above for pattern matching
are qualified by the following rules when the pattern matching notation
is used for filename expansion by
sh(1),
csh(1),
ksh(1),
and
make(1).
If a filename (including the component of a pathname that follows the
slash
(/)
character) begins with a period
(.),
the period must be explicitly matched
by using a period as the first character of the pattern;
it cannot be matched by either the asterisk special character,
the question mark special character, or a bracket expression.
This rule does not apply to
make(1). The slash character in a pathname must be explicitly matched
by using a slash in the pattern;
it cannot be matched by either the asterisk special character,
the question mark special character, or a bracket expression.
For
make(1)
only the part of the pathname following the last slash character
can be matched by a special character.
That is, all special characters preceding the last slash character
lose their special meaning. Specified patterns are matched
against existing filenames and pathnames, as appropriate.
If the pattern matches any existing filenames or pathnames,
the pattern is replaced with those filenames and pathnames,
sorted according to the collating sequence in effect.
If the pattern does not match any existing filenames or pathnames,
the pattern string is left unchanged. If the pattern begins with a tilde
(~)
character, all of the ordinary characters preceding the first slash
(or all characters if there is no slash)
are treated as a possible login name.
If the login name is null
(i.e., the pattern contains only the tilde or the tilde is
immediately followed by a slash),
the tilde is replaced by a pathname of the process's home directory,
followed by a slash.
Otherwise, the combination of tilde and login name
are replaced by a pathname of the home directory
associated with the login name, followed by a slash.
If the system cannot identify the login name,
the result is implementation-defined.
This rule does not apply to
sh(1)
or
make(1). If the pattern contains a
$
character, variable substitution can take place.
Environmental variables can be embedded within patterns as:
Braces are used to guarantee that characters following
name
are not interpreted as belonging to
name.
Substitution occurs in the order specified only once;
that is, the resulting string is not examined again for new names
that occurred because of the substitution.
Rule Qualification for Patterns Used in the case CommandThe rules described above for pattern matching
are qualified by the following rule when the pattern matching notation
is used in the case command of
sh(1)
and
ksh(1).
Multiple alternative patterns in a single clause
can be specified by separating individual patterns
with the vertical bar character
(|);
strings matching any of the patterns separated this way will cause the
corresponding command list to be selected.
STANDARDS CONFORMANCE<regexp.h>: AES, SVID2, SVID3, XPG2, XPG3, XPG4
|