Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 11 - C API Reference Guide, Part II
String and URI Manipulation

In this section...

Introduction

String Parsing Functions

String Comparison, Pattern Matching, and Transformation

Type Checking Macros

URI Parsing and Manipulation

Introduction

Show Contents Go to Top Previous Page Next Page

The Apache API provides an extensive set of functions for parsing and manipulating strings and URIs. Some of these routines are functionally identical to standard C library functions but provide either a performance boost or enhanced safety. Other routines provide completely new functionality.

String Parsing Functions

Show Contents Go to Top Previous Page Next Page

While Apache's library for parsing and manipulating character strings is not nearly as rich as Perl's text processing abilities, it is a vast improvement over what's available in the impoverished standard C library.

Most of the string parsing routines belong to the ap_getword* family, which together provide functionality similar to the Perl split() function. Each member of this family is able to extract a word from a string, splitting the text on delimiters such as whitespace or commas. Unlike Perl split(), in which the entire string is split at once and the pieces are returned in a list, the ap_getword* functions operate on one word at a time. The function returns the next word each time it's called and keeps track of where it's been by bumping up a pointer.

All of the ap_getword* routines are declared in httpd.h. The original declarations in httpd.h refer to the second argument as char **line. In the function prototypes that follow, we've changed the name of this argument to char **string in order to avoid the implication that the argument must always correspond to a single line of text.

char *ap_getword (pool *p, const char **string, char stop)

ap_getword() is the most frequently used member of this family. It takes a pointer to a char* and splits it into words at the delimiter given by the stop character. Each time the function is called it returns the next word, allocating a new string from the resource pool pointer p to hold the word. The char** is updated after each call so that it points to the place where the previous call left off.

Here is an example of using ap_getword() to split a URL query string into its component key/value pairs. ap_getword() is called in two different contexts. First it's called repeatedly to split the query string into words delimited by the & character. Then, each time through the loop, the function is called once again to split the word into its key/value components at the = delimiter. The names and values are then placed into a table to return to the caller:

while(*data && (val = ap_getword(r->pool, &data, '&'))) {
  key = ap_getword(r->pool, &val, '=');

   ap_unescape_url((char *)key);
  ap_unescape_url((char *)val);
  ap_table_merge(tab, key, val);
}

This API also makes parsing HTTP cookies a breeze. In the following code fragment, util_parse_cookie() fetches the incoming HTTP cookies and parses them into a table. The incoming HTTP Cookie field, if present, contains one or more cookies separated by semicolons. Each cookie has the format =&&, where the cookie's name is separated from a list of values by the = sign, and each value is, in turn, delimited by the & character. The values are escaped using the URI escaping rules in much the same way that CGI parameters are.

The code begins by retrieving the value of Cookie. It then splits it into individual name=value pairs using the ap_get_word() function. After trimming whitespace, ap_ getword() is called once more to split each cookie into its name and value parts and again a third time to split out the individual values. The values are unescaped with ap_ unescape_url(), and the parsed name and values are then added to a growing table:

table *util_parse_cookie(request_rec *r)
{
   const char *data = ap_table_get(r->headers_in, "Cookie");
   table *cookies;
   const char *pair;
   if(!data) return NULL;

    cookies = ap_make_table(r->pool, 4);
   while(*data && (pair = ap_getword(r->pool, &data, ';'))) {
       const char *name, *value;
       if(*data == ' ') ++data;
       name = ap_getword(r->pool, &pair, '=');
       while(*pair && (value = ap_getword(r->pool, &pair, '&'))) {
           ap_unescape_url((char *)value);
           ap_table_add(cookies, name, value);
       }
   }

    return cookies;
}

char *ap_getword_nc (pool *p, char **string, char stop)

This function is exactly the same as ap_getword(), but it accepts a non-const string pointer. Internally this routine shares all its code with ap_getword() and is simply provided as a convenience for avoiding a typecast.

char *ap_getword_nulls (pool *p, const char **string, char stop)

Unlike ap_getword(), which will skip multiple occurrences of the stop delimiter, ap_ getword_nulls() preserves empty entries; that is, if the delimiter is a comma and the string looks like this:
larry,,curly
Then ap_getword() ignores the empty entry between the first and last words, while ap_getword_nulls() will return an empty string the second time it is called.

char *ap_getword_nulls_nc (pool *p, char **string, char stop)

This function is the same as ap_getword_nulls(), except that it accepts a nonconstant string pointer.

char *ap_getword_white (pool *p, const char **string)

Because it is so common for a string of words to be delimited by variable amounts of whitespace, the ap_getword_white() function is provided for your use. In this case the delimiter is any number of space characters, form-feeds, newlines, carriage returns, or vertical tabs. This function is particularly useful for processing whitespace-delimited configuration directives.

while(*data && (val = ap_getword_white(r->pool, &data))) {
  ...
}

char * ap_getword_white_nc (pool *p, char **string)

This function is exactly the same as ap_getword_white(), but it accepts a nonconstant string pointer.

char *ap_getword_conf (pool *p, const char **string)

This function is much like ap_getword_white(), but it takes single- and double-quoted strings into account as well as whitespace escaped with backslashes. This is the routine used internally to process Apache's configuration files.
During processing, the quotes and backslashes are stripped from the word. For example, given the following string, ap_getword_conf() will return Hello World on the first pass and Example on the second:
"Hello World" Example
If a backslash were present before the space preceding Example, the entire string would be treated as a single word and returned on the first call to ap_getword_conf().

char *ap_getword_conf_nc (pool *p, char **string)

This function is exactly the same as ap_getword_conf(), but it accepts a nonconstant string pointer.

char *ap_get_token (pool *p, const char **string, int accept_white)

This function is generally used to parse multivalued HTTP headers, which are delimited by commas or semicolons. If the accept_white parameter is nonzero, then whitespace will also be treated as a delimiter. Substrings enclosed in quotes are treated as single words, and, like ap_getword_conf(), the quotes are stripped from the return value. However, unlike ap_getword_conf(), backslashes are not honored. Regardless of the setting of accept_ white, leading and trailing spaces are always stripped from the return value.
The mod_negotiation module makes heavy use of this function to parse Accept and Accept- language headers.
Here is a typical example of using this function to extract all the words in the string stored in data:

while(*data && (val = ap_get_token(r->pool, &data, 0))) {
  ...
}

int ap_find_token (pool *p, const char *string, const char *tok)
int ap_find_last_token (pool *p, const char *string, const char *tok)

These two functions are used for searching for particular tokens within HTTP header fields. A token is defined by RFC 2068 as a case-insensitive word delimited by the following separators:

      separators     = "(" | ")" | "<" | ">" | "@"
                    | "," | ";" | ":" | "\" | <">
                    | "/" | "[" | "]" | "?" | "="
                    | "{" | "}" | SP | HT

ap_find_token() will return true if any token in the specified string matches the third argument, tok. ap_find_last_token() will return true if the last token in the string matches tok. Both functions match the token substring in a case-insensitive manner. This is useful if you want to search HTTP headers that contain multiple values, without having to parse through the whitespace, quotation marks, and other delimiter characters on your own. For example, this code fragment shows one way to detect the presence of a gzip token in the HTTP header Accept-encoding:

if(ap_find_token(p, ap_table_get(r->headers_in, "Accept-encoding"), "gzip")) {
  /* we could do some on-the-fly compression */
}

Show Contents Go to Top Previous Page Next Page