Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 11 - C API Reference Guide, Part II / String and URI Manipulation
String Comparison, Pattern Matching, and Transformation

The following group of functions provides string pattern matching, substitution, and transformation operations similar to (but more limited than) Perl's built-in operators.

Most of these functions are declared in httpd.h. The few exceptions are listed separately.

int ap_fnmatch (const char *pattern, const char *string, int flags)

(Declared in the header file fnmatch.h.) The ap_fnmatch() function is based on the POSIX.2 fnmatch() function. You provide a search pattern, a string to search, and a bit mask of option flags. The function will return 0 if a match is found, or the nonzero constant FNM_NOMATCH otherwise. Note that the function result is the reverse of what you would expect. It is done this way in order to be compatible with strcasecmp(). It may be less confusing to compare the function result to the constant FNM_NOMATCH than to test for zero.
The pattern you provide is not a regular expression, but a shell-style glob pattern. In addition to the wildcard characters * and ?, patterns containing both string sets like foo.{h,c,cc} and character ranges like .[a-zA-Z]* are allowed. The flags argument is the bitwise combination of zero or more of the following constants (defined in fnmatch.h):
FNM_NOESCAPE
If set, treat the backslash character as an ordinary character instead of as an escape.
FNM_PATHNAME
If set, allow a slash in string to match only a slash in pattern and never a wildcard character or character range.
FNM_PERIOD
If this flag is set, a leading period in string must match exactly with a period in pattern. A period is considered to be leading if it is the first character in string or if FNM_PATHNAME is set and the period immediately follows a slash.
FNM_CASE_BLIND
If this flag is set, then a case-insensitive comparison is performed. This is an Apache extension and not part of the POSIX.2 standard.
Typically you will use ap_fnmatch() to match filename patterns. In fact, this function is used internally for matching glob-style patterns in configuration sections such as FilesMatch and LocationMatch. Example:

if(ap_fnmatch("*.html", filename, FNM_PATHNAME|FNM_CASE_BLIND)
      != FNM_NOMATCH) {
  ...
}

int ap_is_fnmatch (const char *pattern)

(Declared in the header file fnmatch.h.) This function returns true if pattern contains glob characters, false otherwise. It is useful in deciding whether to perform an ap_fnmatch() pattern search or an ordinary string comparison.

if (ap_is_fnmatch(target)) {
  file_matches = !ap_fnmatch(filename, target, FNM_PATHNAME);
}
else {
  file_matches = !strcmp(filename, target);
}

int ap_strcmp_match (const char *string, const char *pattern

Just to add to the confusion, ap_strcmp_match() provides functionality similar to ap_ fnmatch() but only recognizes the * and ? wildcards. The function returns 0 if a match is found, nonzero otherwise. This is an older function, and there is no particular reason to prefer it. However, you'll see it used in some standard modules, including in mod_autoindex where it is called on to determine what icon applies to a filename.

if(!ap_strcmp_match(filename, "*.html")) {
  ...
}

int ap_strcasecmp_match (const char *str, const char *exp)

ap_strcasecmp_match is the same as ap_strcmp_match but case-insensitive.

int ap_is_matchexp (const char *string)

This function returns true if the string contains either of the wildcard characters * and ?, false otherwise. It is useful for testing whether a user-provided configuration string should be treated as a pattern to be passed to ap_strcmp_match() or as an ordinary string. Example:

if (ap_is_matchexp(target)) {
  file_matches = !ap_strcmp_match(filename, target);
}
else {
  file_matches = !strcmp(filename, target);
}

int ap_checkmask (const char *string, const char *mask)

(Declared in the header file util_date.h.) The ap_checkmask() function will attempt to match the given string against the character mask. Unlike the previous string matching functions, ap_checkmask() will return true (nonzero) for a successful match, false (zero) if the match fails.
The mask is constructed from the following characters:
@ uppercase letter
$ lowercase letter
& hex digit
# digit
~ digit or space
* swallow remaining characters
x exact match for any other character
For example, ap_parseHTTPdate() uses this function to determine the date format, such as RFC 1123:

if (ap_checkmask(date, "## @$$ #### ##:##:## *")) {
     ...
}

Because it was originally written to support date and time parsing routines, this function is declared in util_date.h.

int ap_ind (const char *s, char c)

This function is equivalent to the standard C library index() function. It will scan the character string s from left to right until it finds the character c, returning the location of the first occurrence of c, or -1 if the character is not found. Note that the function result is the integer index of the located character, not a string pointer as in the standard C function.

int ap_rind (const char *s, char c)

ap_rind() behaves like ap_ind(), except that it scans the string from right to left, returning the index of the rightmost occurrence of character c. This function is particularly useful for Hebrew and Arabic texts.

regex_t *ap_pregcomp (pool *p, const char *pattern, int cflags);
void ap_pregfree (pool *p, regex_t *reg);

Apache supports regular expression matching using the system library's regular expression routines regcomp(), regexec(), regerror(), and regfree(). If these functions are not available, then Apache uses its own package of regular expression routines. Documentation for the regular expression routines can be found in your system manual pages. If your system does not support these routines, the documentation for Apache's regular expression package can be found in the regex/ subdirectory of the Apache source tree.
We won't try to document the complexities of regular expression matching here, except to remind you that regular expression matching occurs in two phases. In the first phase, you call regcomp() to compile a regular expression pattern string into a compiled form. In the second phase, you pass the compiled pattern to regexec() to match the search pattern against a source string. In the course of performing its regular expression match, regexec() writes the offsets of each matched parenthesized subexpression into an array named pmatch[]. The significance of this array will become evident in the next section when we discuss ap_pregsub().
For your convenience, Apache provides wrapper routines around regcomp() and regfree() that make working with regular expressions somewhat simpler. ap_pregcomp() works like regcomp() to compile a regular expression string, except that it automatically allocates memory for the compiled expression from the provided resource pool pointer. pattern contains the string to compile, and cflags is a bit mask of flags that control the type of regular expression to perform. The full list of flags can be found in the regcomp() manual page.
In addition to allocating the regular expression, ap_pregcomp() automatically installs a cleanup handler that calls regfree() to release the memory used by the compiled regular expression when the transaction is finished. This relieves you of the responsibility of doing this bit of cleanup yourself.
Speaking of which, the cleanup handler installed by ap_pregcomp() is ap_pregfree(). It frees the regular expression by calling regfree() and then removes itself from the cleanup handler list to ensure that it won't be called twice. You may call ap_pregfree() yourself if, for some unlikely reason, you need to free up the memory used by the regular expression before the cleanup would have been performed normally.
char *ap_pregsub (pool *p, const char *input, const char *source, size_t nmatch, regmatch_t pmatch[])
After performing a regular expression match with regexec(), you may use ap_pregsub() to perform a series of string substitutions based on subexpressions that were matched during the operation. The function is broadly similar in concept to what happens in the right half of a Perl s/// operation.
This function uses the pmatch[] array, which regexec() populates with the start and end positions of all the parenthesized subexpressions matched by the regular expression. You provide ap_pregsub() with p, a resource pool pointer, input, a character string describing the substitutions to perform, source, the source string used for the regular expression match, nmatch, the size of the pmatch array, and pmatch itself.
input is any arbitrary string containing the expressions $1 through $9. ap_pregsub() replaces these expressions with the corresponding matched subexpressions from the source string. $0 is also available for your use: it corresponds to the entire matched string.
The return value will be a newly allocated string formed from the substituted input string.
The following example shows ap_pregsub() being used to replace the .htm and .HTM filename extensions with .html. We begin by calling ap_pregcomp() to compile the desired regular expression and return the compiled pattern in memory allocated from the resource pool. We specify flags that cause the match to be case-insensitive and to use the modern regular expression syntax. We proceed to initialize the pmatch[] array to hold two regmatch_t elements. Two elements are needed: the first which corresponds to $0 and the second for the single parenthesized subexpression in the pattern. Next we call regexec() with the compiled pattern, the requested filename, the pmatch[] array, and its length. The last argument to regexec(), which is used for passing various additional option flags, is set to zero. If regexec() returns zero, we go on to call ap_pregsub() to interpolate the matched subexpression (the filename minus its extension) into the string $1.html, effectively replacing the extension.

regmatch_t pmatch[2];
regex_t *cpat = ap_pregcomp(r->pool, "(.+)\\.htm$", REG_EXTENDED|REG_ICASE);
if (regexec(cpat, r->filename, cpat->re_nsub+1, pmatch, 0) == 0) {
   r->filename = ap_pregsub(r->pool, "$1.html",
                            r->filename, cpat->re_nsub+1, pmatch);
}

char *ap_escape_shell_cmd (pool *p, const char *string)

If you must pass a user-provided string to a shell command, you should first use ap_ escape_shell_cmd() to escape characters that might otherwise be interpreted as shell metacharacters. The function inserts backslashes in front of the potentially unsafe characters and returns the result as a new string.
Unsafe characters include the following:

& ; ` ' " | * ? ~ < > ^ ( ) [ ] { } $ \n

Example:

char *escaped_cmd = ap_escape_shell_cmd(r->pool, command);

Do not rely only on this function to make your shell commands safe. The commands themselves may behave unpredictably if presented with unreasonable input, even if the shell behaves well. The best policy is to use a regular expression match to sanity-check the contents of all user-provided data before passing it on to external programs.

char *ap_escape_quotes (pool *p, const char *string)

This function behaves similarly to the previous one but only escapes double quotes.

char *escaped_string = ap_escape_quotes(r->pool, string);

void ap_str_tolower (char *string)

This function converts all uppercase characters in the given string to lowercase characters, modifying the new string in place.

ap_str_tolower(string);

char *ap_escape_html (pool *p, const char *string)

The ap_escape_html() function takes a character string and returns a modified copy in which all special characters (such as > and <) are replaced with their HTML entities. This makes the string safe to use inside an HTML page. For example, after the following example is run, the resulting string will read <h1>Header Level 1 Example</h1>:

char *display_html = ap_escape_html(p, "<h1>Header Level 1 Example</h1>");

char *ap_uuencode (pool *p, const char *string)

This function takes a string, base64-encodes it, and returns the encoded version in a new string allocated from the provided resource pool. Base64 is the algorithm used by the uuencode program (hence the function name) and is widely used by the MIME system for packaging binary email enclosures.

char *encoded = ap_uuencode(p, encoded);

char *ap_uudecode (pool *p, char *string)

ap_uudecode() reverses the effect of the previous function, transforming a base64- encoded string into its original representation.

char *decoded = ap_uudecode(p, encoded);

Show Contents Previous Page Next Page