Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX Reference > R

regexp(5)

HP-UX 11i Version 3: February 2007
» 

Technical documentation

» Feedback
Content starts here

 » Table of Contents

 » Index

NAME

regexp — regular expression and pattern matching notation definitions

DESCRIPTION

A Regular Expression is a mechanism supported by many utilities for locating and manipulating patterns in text. Pattern Matching Notation is used by shells and other utilities for file name expansion. This manual entry defines two forms of regular expressions: Basic Regular Expressions and Extended Regular Expressions; and the one form of Pattern Matching Notation.

BASIC REGULAR EXPRESSIONS

Basic regular expression (RE) notation and construction rules apply to utilities defined as using basic REs. Any exceptions to the following rules are noted in the descriptions of the specific utilities that use REs.

REs Matching a Single Character

The following REs match a single character or a single collating element:

Ordinary Characters

An ordinary character is an RE that matches itself. An ordinary character is any character in the supported character set except newline and the regular expression special characters listed in Special Characters below. An ordinary character preceded by a backslash (\) is treated as the ordinary character itself, except when the character is (, ), {, or }, or the digits 1 through 9 (see REs Matching Multiple Characters). Matching is based on the bit pattern used for encoding the character; not on the graphic representation of the character.

Special Characters

A regular expression special character preceded by a backslash is a regular expression that matches the special character itself. When not preceded by a backslash, such characters have special meaning in the specification of REs. Regular expression special characters and the contexts in which they have special meaning are:

. [ \

The period, left square bracket, and backslash are special except when used in a bracket expression (see RE Bracket Expression).

*

The asterisk is special except when used in a bracket expression, as the first character of a regular expression, or as the first character following the character pair \( (see REs Matching Multiple Characters).

^

The circumflex is special when used as the first character of an entire RE (see Expression Anchoring) or as the first character of a bracket expression.

$

The dollar sign is special when used as the last character of an entire RE (see Expression Anchoring).

delimiter

Any character used to bound (i.e., delimit) an entire RE is special for that RE.

Period

A period (.), when used outside of a bracket expression, is an RE that matches any printable or nonprintable character except newline.

RE Bracket Expression

A bracket expression enclosed in square brackets ([ ]) is an RE that matches a single collating element contained in the nonempty set of collating elements represented by the bracket expression.

The following rules apply to bracket expressions:

bracket expression

A bracket expression is either a matching list expression or a non-matching list expression, and consists of one or more expressions in any order. Expressions can be: collating elements, collating symbols, noncollating characters, equivalence classes, range expressions, or character classes. The right bracket (]) loses its special meaning and represents itself in a bracket expression if it occurs first in the list (after an initial ^, if any). Otherwise, it terminates the bracket expression (unless it is the ending right bracket for a valid collating symbol, equivalence class, or character class, or it is the collating element within a collating symbol or equivalence class expression). The special characters

. * [ \

(period, asterisk, left bracket, and backslash) lose their special meaning within a bracket expression.

The character sequences:

[. [= [:

(left-bracket followed by a period, equal-sign or colon) are special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions and character class expressions. These symbols must be followed by a valid expression and the matching terminating .], =], or :].

matching list

A matching list expression specifies a list that matches any one of the characters represented in the list. The first character in the list cannot be the circumflex. For example, [abc] is an RE that matches any of a, b, or c.

non-matching list

A non-matching list expression begins with a circumflex (^), and specifies a list that matches any character or collating element except newline and the characters represented in the list. For example, [^abc] is an RE that matches any character except newline or a, b, or c. The circumflex has this special meaning only when it occurs first in the list, immediately following the left square bracket.

collating element

A collating element is a sequence of one or more characters that represents a single element in the collating sequence as identified via the most current setting of the locale variable LC_COLLATE (see setlocale(3C)).

collating symbol

A collating symbol is a collating element enclosed within bracket-period ([. .]) delimiters. Multicharacter collating elements must be represented as collating symbols to distinguish them from single-character collating elements. For example, if the string ch is a valid collating element, then [[.ch.]] is treated as an element matching the same string of characters, while ch is treated as a simple list of the characters c and h. If the string within the bracket-period delimiters is not a valid collating element in the current collating sequence definition, the symbol is treated as an invalid expression.

noncollating character

A noncollating character is a character that is ignored for collating purposes. By definition, such characters cannot participate in equivalence classes or range expressions.

equivalence class

An equivalence class expression represents the set of collating elements belonging to an equivalence class. It is expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ([= =]) delimiters. For example, if a and A belong to the same equivalence class, then [[=a=]b] and [[=A=]b] are each equivalent to [aAb].

range expression

A range expression represents the set of collating elements that fall between two elements in the current collation sequence as defined via the most current setting of the locale variable LC_COLLATE (see setlocale(3C)). It is expressed as the starting point and the ending point separated by a hyphen (-).

The starting range point and the ending range point must be a collating element, collating symbol, or equivalence class expression. An equivalence class expression used as an end point of a range expression is interpreted such that all collating elements within the equivalence class are included in the range. For example, if the collating order is A, a, B, b, C, c, ch, D, and d and the characters A and a belong to the same equivalence class, then the expression [[=a=]-D] is treated as [AaBbCc[.ch.]D].

Both starting and ending range points must be valid collating elements, collating symbols, or equivalence class expressions, and the ending range point must collate equal to or higher than the starting range point; otherwise the expression is invalid. For example, with the above collating order and assuming that E is a noncollating character, then both the expressions [[=A=]-E] and [d-a] are invalid.

An ending range point can also be the starting range point in a subsequent range expression. Each such range expression is evaluated separately. For example, the bracket expression [a-m-o] is treated as [a-mm-o].

The hyphen character is treated as itself if it occurs first (after an initial ^, if any) or last in the list, or as the rightmost symbol in a range expression. As examples, the expressions [-ac] and [ac-] are equivalent and match any of the characters a, c, or -; the expressions [^-ac] and [^ac-] are equivalent and match any characters except newline, a, c, or -; the expression [%--] matches any of the characters in the defined collating sequence between % and - inclusive; the expression [--@] matches any of the characters in the defined collating sequence between - and @ inclusive; and the expression [a--@] is invalid, assuming - precedes a in the collating sequence.

If a bracket expression must specify both - and ], the ] must be placed first (after the ^, if any) and the - last within the bracket expression.

character class

A character class expression represents the set of characters belonging to a character class, as defined via the most current setting of the locale variable LC_CTYPE. It is expressed as a character class name enclosed within bracket-colon ([: :]) delimiters.

Standard character class expressions supported in all locales are:

[:alpha:]

letters

[:upper:]

upper-case letters

[:lower:]

lower-case letters

[:digit:]

decimal digits

[:xdigit:]

hexadecimal digits

[:alnum:]

letters or decimal digits

[:space:]

characters producing white-space in displayed text

[:print:]

printing characters

[:punct:]

punctuation characters

[:graph:]

characters with a visible representation

[:cntrl:]

control characters

[:blank:]

blank characters

For example, if the locale variable LC_CTYPE is set to C, the expression [[:upper:]] is equivalent to [A-Z]. Similarly the expression [[:digit:]] is same as [0-9].

REs Matching Multiple Characters

The following rules may be used to construct REs matching multiple characters from REs matching a single character:

RERE

The concatenation of REs is an RE that matches the first encountered concatenation of the strings matched by each component of the RE. For example, the RE bc matches the second and third characters of the string abcdefabcdef.

RE*

An RE matching a single character followed by an asterisk (*) is an RE that matches zero or more occurrences of the RE preceding the asterisk. The first encountered string that permits a match is chosen, and the matched string will encompass the maximum number of characters permitted by the RE. For example, in the string abbbcdeabbbbbbcde, both the RE b*c and the RE bbb*c are matched by the substring bbbc in the second through fifth positions. An asterisk as the first character of an RE loses this special meaning and is treated as itself.

\(RE\)

A subexpression can be defined within an RE by enclosing it between the character pairs \( and \). Such a subexpression matches whatever it would have matched without the \( and \). Subexpressions can be arbitrarily nested. An asterisk immediately following the \( loses its special meaning and is treated as itself. An asterisk immediately following the \) is treated as an invalid character.

\n

The expression \n matches the same string of characters as was matched by a subexpression enclosed between \( and \) preceding the \n. The character n must be a digit from 1 through 9, specifying the n-th subexpression (the one that begins with the n-th \( and ends with the corresponding paired \). For example, the expression ^\(.*\)\1$ matches a line consisting of two adjacent appearances of the same string.

If the \n is followed by an asterisk, it matches zero or more occurrences of the subexpression referred to. For example, the expression \(ab\(cd\)ef\)Z\2*Z\1 matches the string abcdefZcdcdZabcdef.

RE\{m,n\}

An RE matching a single character followed by \{m\}, \{m,\}, or \{m,n\} is an RE that matches repeated occurrences of the RE. The values of m and n must be decimal integers in the range 0 through 255, with m specifying the exact or minimum number of occurrences and n specifying the maximum number of occurrences. \{m\} matches exactly m occurrences of the preceding RE, \{m,\} matches at least m occurrences, and \{m,n\} matches any number of occurrences between m and n, inclusive.

The first encountered string that matches the expression is chosen; it will contain as many occurrences of the RE as possible. For example, in the string abbbbbbbc the RE b\{3\} is matched by characters two through four, the RE b\{3,\} is matched by characters two through eight, and the RE b\{3,5\}c is matched by characters four through nine.

Expression Anchoring

An RE can be limited to matching strings that begin or end a line (i.e., anchored) according to the following rules:

  • A circumflex (^) as the first character of an RE anchors the expression to the beginning of a line; only strings starting at the first character of a line are matched by the RE. For example, the RE ^ab matches the string ab in the line abcdef, but not the same string in the line cdefab.

  • A dollar sign ($) as the last character of an RE anchors the expression to the end of a line; only strings ending at the last character of a line are matched by the RE. For example, the RE ab$ matches the string ab in the line cdefab, but not the same string in the line abcdef.

  • An RE anchored by both ^ and $ matches only strings that are lines. For example, the RE ^abcdef$ matches only lines consisting of the string abcdef.

The use of duplication characters (+,*) following anchors is illegal.

EXTENDED REGULAR EXPRESSIONS

The extended regular expression (ERE) notation and construction rules apply to utilities defined as using extended REs. Any exceptions to the following rules are noted in the descriptions of the specific utilities using EREs.

EREs Matching a Single Character

The following EREs match a single character or a single collating element:

Ordinary Characters

An ordinary character is an ERE that matches itself. An ordinary character is any character in the supported character set except newline and the regular expression special characters listed in Special Characters below. An ordinary character preceded by a backslash (\) is treated as the ordinary character itself. Matching is based on the bit pattern used for encoding the character, not on the graphic representation of the character.

Special Characters

A regular expression special character preceded by a backslash is a regular expression that matches the special character itself. When not preceded by a backslash, such characters have special meaning in the specification of EREs. The extended regular expression special characters and the contexts in which they have their special meaning are:

. [ \ ( ) * + ? $ |

The period, left square bracket, backslash, left parenthesis, right parenthesis, asterisk, plus sign, question mark, dollar sign, and vertical bar are special except when used in a bracket expression (see ERE Bracket Expression).

^

The circumflex is special except when used in a bracket expression in a non-leading position.

delimiter

Any character used to bound (i.e., delimit) an entire ERE is special for that ERE.

Period

A period (.), when used outside of a bracket expression, is an ERE that matches any printable or nonprintable character except newline.

ERE Bracket Expression

The syntax and rules for ERE bracket expressions are the same as for RE bracket expressions found above.

EREs Matching Multiple Characters

The following rules may be used to construct EREs matching multiple characters from EREs matching a single character:

EREERE

A concatenation of EREs matches the first encountered concatenation of the strings matched by each component of the ERE. Such a concatenation of EREs enclosed in parentheses matches whatever the concatenation without the parentheses matches. For example, both the ERE bc and the ERE (bc) matches the second and third characters of the string abcdefabcdef. The longest overall string is matched.

ERE+

The special character plus (+), when following an ERE matching a single character, or a concatenation of EREs enclosed in parenthesis, is an ERE that matches one or more occurrences of the ERE preceding the plus sign. The string matched will contain as many occurrences as possible. For example, the ERE b+c matches the fourth through seventh characters in the string acabbbcde.

ERE*

The special character asterisk (*), when following an ERE matching a single character, or a concatenation of EREs enclosed in parenthesis, is an ERE that matches zero or more occurrences of the ERE preceding the asterisk. For example, the ERE b*c matches the first character in the string cabbbcde. If there is any choice, the longest left-most string that permits a match is chosen. For example, the ERE b*cd matches the third through seventh characters in the string cabbbcdebbbbbbcdbc.

ERE?

The special character question mark (?), when following an ERE matching a single character, or a concatenation of EREs enclosed in parenthesis, is an ERE that matches zero or one occurrences of the ERE preceding the question mark. The string matched will contain as many occurrences as possible. For example, the ERE b?c matches the second character in the string acabbbcde.

ERE{m,n}

interval expression that functions the same way as basic regular expression syntax, ERE\{m,n\}

Alternation

Two EREs separated by the special character vertical bar (|) matches a string that is matched by either ERE. For example, the ERE ((ab)|c)d matches the string abd and the string cd. A vertical bar '|' may not appear as follows:

  • may not appear first or last in an ERE.

  • may not appear immediately following a vertical bar.

  • may not appear after a left parenthesis.

  • may not appear immediately preceding a right parenthesis.

Precedence

The order of precedence is as follows, from high to low:

[ ]

square brackets

* + ?

asterisk, plus sign, question mark

^ $

anchoring

concatenation

|

alternation

For example, the ERE abba|cde is interpreted as "match either abba or cde. It does not mean "match abb followed by a or c followed in turn by de (because concatenation has a higher order of precedence than alternation).

Expression Anchoring

An ERE can be limited to matching strings that begin or end a line (i.e., anchored) according to the following rules:

  • A circumflex (^) matches the beginning of a line (anchors the expression to the beginning of a line). For example, the ERE ^ab matches the string ab in the line abcdef, but not the same string in the line cdefab.

  • A dollar sign ($) matches the end of a line (anchors the expression to the end of a line). For example, the ERE ab$ matches the string ab in the line cdefab, but not the same string in the line abcdef.

  • An ERE anchored by both ^ and $ matches only strings that are lines. For example, the ERE ^abcdef$ matches only lines consisting of the string abcdef. Only empty lines match the ERE ^$.

The use of duplication characters (+,*) following anchors is illegal.

PATTERN MATCHING NOTATION

The following rules apply to pattern matching notation except as noted in the descriptions of the specific utilities using pattern matching.

Patterns Matching a Single Character

The following patterns match a single character or a single collating element:

Ordinary Characters

An ordinary character is a pattern that matches itself. An ordinary character is any character in the supported character set except newline and the pattern matching special characters listed in Special Characters below. Matching is based on the bit pattern used for encoding the character, not on the graphic representation of the character.

Special Characters

A pattern matching special character preceded by a backslash (\) is a pattern that matches the special character itself. When not preceded by a backslash, such characters have special meaning in the specification of patterns. The pattern matching special characters and the contexts in which they have their special meaning are:

? * [

The question mark, asterisk, and left square bracket are special except when used in a bracket expression (see Pattern Bracket Expression).

Question Mark

A question mark (?), when used outside of a bracket expression, is a pattern that matches any printable or nonprintable character except newline.

Pattern Bracket Expression

The syntax and rules for pattern bracket expressions are the same as for RE bracket expressions found above with the following exceptions:

  • The exclamation point character (!) replaces the circumflex character (^) in its role in a non-matching list in the regular expression notation.

  • The backslash is used as an escape character within bracket expressions.

Patterns Matching Multiple Characters

The following rules may be used to construct patterns matching multiple characters from patterns matching a single character:

*

The asterisk (*) is a pattern that matches any string, including the null string.

RERE

The concatenation of patterns matching a single character is a valid pattern that matches the concatenation of the single characters or collating elements matched by each of the concatenated patterns. For example, the pattern a[bc] matches the string ab and ac.

The concatenation of one or more patterns matching a single character with one or more asterisks is a valid pattern. In such patterns, each asterisk matches a string of zero or more characters, up to the first character that matches the character following the asterisk in the pattern.

For example, the pattern a*d matches the strings ad, abd, and abcd; but not the string abc. When an asterisk is the first or last character in a pattern, it matches zero or more characters that precede or follow the characters matched by the remainder of the pattern. For example, the pattern a*d* matches the strings ad, abcd, abcdef, aaaad, and adddd; the pattern *a*d matches the strings ad, abcd, efabcd, aaaad, and adddd.

Rule Qualification for Patterns Used for Filename Expansion

The rules described above for pattern matching are qualified by the following rules when the pattern matching notation is used for filename expansion by sh(1), csh(1), ksh(1), and make(1).

  • If a filename (including the component of a pathname that follows the slash (/) character) begins with a period (.), the period must be explicitly matched by using a period as the first character of the pattern; it cannot be matched by either the asterisk special character, the question mark special character, or a bracket expression. This rule does not apply to make(1).

  • The slash character in a pathname must be explicitly matched by using a slash in the pattern; it cannot be matched by either the asterisk special character, the question mark special character, or a bracket expression. For make(1) only the part of the pathname following the last slash character can be matched by a special character. That is, all special characters preceding the last slash character lose their special meaning.

  • Specified patterns are matched against existing filenames and pathnames, as appropriate. If the pattern matches any existing filenames or pathnames, the pattern is replaced with those filenames and pathnames, sorted according to the collating sequence in effect. If the pattern does not match any existing filenames or pathnames, the pattern string is left unchanged.

  • If the pattern begins with a tilde (~) character, all of the ordinary characters preceding the first slash (or all characters if there is no slash) are treated as a possible login name. If the login name is null (i.e., the pattern contains only the tilde or the tilde is immediately followed by a slash), the tilde is replaced by a pathname of the process's home directory, followed by a slash. Otherwise, the combination of tilde and login name are replaced by a pathname of the home directory associated with the login name, followed by a slash. If the system cannot identify the login name, the result is implementation-defined. This rule does not apply to sh(1) or make(1).

  • If the pattern contains a $ character, variable substitution can take place. Environmental variables can be embedded within patterns as:

    $name

  • or:

    ${name}

  • Braces are used to guarantee that characters following name are not interpreted as belonging to name. Substitution occurs in the order specified only once; that is, the resulting string is not examined again for new names that occurred because of the substitution.

Rule Qualification for Patterns Used in the case Command

The rules described above for pattern matching are qualified by the following rule when the pattern matching notation is used in the case command of sh(1) and ksh(1).

  • Multiple alternative patterns in a single clause can be specified by separating individual patterns with the vertical bar character (|); strings matching any of the patterns separated this way will cause the corresponding command list to be selected.

STANDARDS CONFORMANCE

<regexp.h>: AES, SVID2, SVID3, XPG2, XPG3, XPG4

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1983-2007 Hewlett-Packard Development Company, L.P.