home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Writing Apache Modules with Perl and C
By:   Lincoln Stein and Doug MacEachern
Published:   O'Reilly & Associates, Inc.  - March 1999

Copyright © 1999 by O'Reilly & Associates, Inc.


   Show Contents   Previous Page   Next Page

Chapter 11 - C API Reference Guide, Part II / String and URI Manipulation
URI Parsing and Manipulation

In addition to the general string manipulation routines described above, Apache provides specific routines for manipulating URIs. With these routines you can break a URI into its components and put it back together again.

The main data structure used by these routines is the uri_components struct. The typedef for uri_components is found in the util_uri.h header file and reproduced in Example 11-5. For your convenience, a preparsed uri_components struct is contained in every incoming request, in the field parsed_uri. The various fields of the parsed URI are as follows:

char *scheme

This field contains the URI's scheme. Possible values include http, https, ftp, and file.

char *hostinfo

This field contains the part of the URI between the pair of initial slashes and the beginning of the document path. It is often just the hostname for the request, but its full form includes the port and the username/password combination needed to gain access under certain protocols (such as nonanonymous FTP). Here's an example hostinfo string that shows all the optional parts:


char *user

The field contains the username part of the hostinfo field or an empty string if absent.

char *password

This field contains the password part of the hostinfo field or an empty string if absent.

char *port_str

This field contains the string representation of the port. You can fetch the numeric representation from the port field.

char *path

This field corresponds to the path portion of the URI, namely everything after the hostinfo. Neither the query string (the optional text that follows the ? symbol) nor the optional #anchor names that appear at the ends of many HTTP URLs are part of the path. It is equivalent to r->uri.

char *query

The query field holds the query string, that is, everything after the ? in the path but not including the #anchor fragment, if any. It is equivalent to r->args.

char *fragment

This field contains the #anchor fragment, if any. The # symbol itself is omitted.

unsigned short port

port holds the port number of the URI, in integer form. For the same information in text form, see port_str.

The other fields in the uri_components record are for internal use only and are not to be relied on.

Example 11-5. The uri_components Data Type

typedef struct {
  char *scheme;       /* scheme ("http"/"ftp"/...) */
  char *hostinfo;     /* combined [user[:password]@]host[:port] */
  char *user;         /* user name, as in http://user:passwd@host:port/ */
  char *password;     /* password, as in http://user:passwd@host:port/ */
  char *hostname;     /* hostname from URI (or from Host: header) */
  char *port_str;     /* port string (integer representation is in "port") */
  char *path;         /* the request path
                        (or "/" if only scheme://host was given) */
  char *query;        /* Everything after a '?' in the path, if present */
  char *fragment;     /* Trailing "#fragment" string, if present */
   struct hostent *hostent;
  unsigned short port;  /* The port number, numeric, NULL */
                           valid only if port_str != NULL */
   unsigned is_initialized:1;
   unsigned dns_looked_up:1;
  unsigned dns_resolved:1;
} uri_components;

In addition to the uri_components record located in the request record's parsed_uri field, you can access Apache's URI parsing and manipulation package using a series of routines variously declared in httpd.h and util_uri.h:

int ap_unescape_url (char *url)

(Declared in the header file httpd.h.) This routine will unescape URI hex escapes. The escapes are performed in place, replacing the original string. During the unescaping process, Apache performs some basic consistency checking on the URI and returns the result of this check as the function result code. The function will return HTTP_BAD_ REQUEST if it encounters an invalid hex escape (for example, %1g), and HTTP_NOT_ FOUND if replacing a hex escape with its text equivalent results in either the character / or \0. If the URI passes these checks, the function returns OK.

if (ap_unescape_url(url) != OK) {
                r->server, "bad URI during unescaping");

char *ap_os_escape_path (pool *p, const char *path, int partial)

(Declared in the header file httpd.h.) ap_os_escape_path() takes a filesystem pathname in path and converts it into a properly escaped URI in an operating system-dependent way, returning the new string as its function result. If the partial flag is false, then the function will add a / to the beginning of the URI if the path does not already begin with one. If the partial flag is true, the function will not add the slash.

char *escaped = ap_os_escape_path(p, url, 1);

int ap_is_url (const char *string)

(Declared in the header file httpd.h.) This function returns true if string is a fully qualified URI (including scheme and hostname), false otherwise. Among other things it is handy when processing configuration directives that are expected to accept URIs.

if(ap_is_url(string)) {

char *ap_construct_url (pool *p, const char *uri, const request_rec *r)

This function builds a fully qualified URI string from the path specified by uri, using the information stored in the request record r to determine the server name and port. The port number is not included in the string if it is the same as the default port 80.

For example, imagine that the current request is directed to the virtual server www.modperl.com at port 80. Then the following call will return the string http://www.modperl.com/ index.html:

char *url = ap_construct_url(r->pool, "/index.html", r);

char *ap_construct_server (pool *p, const char *hostname, unsigned port, const request_rec *r)

(Declared in the header file httpd.h.) The ap_construct_server() function builds the hostname:port part of a URI and returns it as a new string. The port will not be included in the string if it is the same as the default. You provide a resource pool in p, the name of the host in hostname, the port number in port, and the current request record in r. The request record is used to determine the default port number only and is not otherwise involved in constructing the string.

For example, the following code will return www.modperl.com:8001:

char *server = ap_construct_server(r->pool, hostname, 8001, r);

unsigned short ap_default_port_for_scheme (const char *scheme)

(Declared in the header file util_uri.h.) This handy routine returns the default port number for the given URL scheme. The scheme you provide is compared in a case-insensitive manner to an internal list maintained by Apache. For example, here's how to determine the default port for the secure HTTPS scheme:

unsigned short port = ap_default_port_for_scheme("https");

unsigned short ap_default_port_for_request (const request_rec *r)

(Declared in the header file util_uri.h.) The ap_default_port_for_request() function looks up the scheme from the request record argument, then calls ap_default_port() to return the default port for that scheme. It is almost exactly equivalent to calling ap_default_port_ for_scheme(r->parsed_uri.scheme).

unsigned short port = ap_default_port_for_request(r);

struct hostent * ap_pgethostbyname (pool *p, const char *hostname)

(Declared in the header file util_uri.h.) This function is a wrapper around the standard gethostbyname() function. The struct hostent pointer normally returned by the standard function lives in static storage space, so ap_pgethostbyname() makes a copy of this structure from memory allocated in the passed resource pool in order to avoid any trouble this might cause. This allows the call to be thread-safe.

int ap_parse_uri_components (pool *p, const char *uri, uri_components *uptr)

(Declared in the header file util_uri.h.) Given a pool pointer p, a URI uri, and a uri_components structure pointer uptr, this routine will parse the URI and place the extracted components in the appropriate fields of uptr. The return value is either HTTP_OK (integer 200, not to be confused with the usual OK which is integer 0) to indicate parsing success or HTTP_BAD_REQUEST to indicate that the string did not look like a valid URI.

uri_components uri;
int rc = ap_parse_uri_components(p, "http://www.modperl.com/index.html", &uri);

char *ap_unparse_uri_components (pool *p, const uri_components *uptr, unsigned flags);

(Declared in the header file util_uri.h.) The interesting ap_unparse_uri_components() routine reverses the effect of the previous call, using a populated uri_components record to create a URI string, which is returned as the function result. The flags argument is a bit mask of options that modify the constructed URI string. Possible values for flags include:


Suppress the scheme and hostinfo parts from the constructed URI.


Suppress the username from the hostinfo part of the URI.


Suppress the password from the hostinfo part of the URI.


For security reasons, unless the UNP_REVEALPASSWORD bit is explicitly set, the password part of the URI will be replaced with a series of X characters.


If this bit is set, completely suppress the path part of the URI, including the query string.


Suppress the query string and the fragment, if any. The following example will re-create the URI without the username and password parts.

char *string = ap_unparse_uri_components(p, &uri,
   Show Contents   Previous Page   Next Page
Copyright © 1999 by O'Reilly & Associates, Inc.