home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  

Book HomePHP CookbookSearch this book

11.14. Parsing a Web Server Log File

11.14.2. Solution

Open the file and parse each line with a regular expression that matches the log file format. This regular expression matches the NCSA Combined Log Format:

$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

11.14.3. Discussion

This program parses the NCSA Combined Log Format lines and displays a list of pages sorted by the number of requests for each page:

$log_file = '/usr/local/apache/logs/access.log';
$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

$fh = fopen($log_file,'r') or die($php_errormsg);
$i = 1;
$requests = array();
while (! feof($fh)) {
    // read each line and trim off leading/trailing whitespace
    if ($s = trim(fgets($fh,16384))) {
        // match the line to the pattern
        if (preg_match($pattern,$s,$matches)) {
            /* put each part of the match in an appropriately-named
             * variable */
                 $user_agent) = $matches;
             // keep track of the count of each request 
        } else {
            // complain if the line didn't match the pattern 
            error_log("Can't parse line $i: $s");
fclose($fh) or die($php_errormsg);

// sort the array (in reverse) by number of requests 

// print formatted results
foreach ($requests as $request => $accesses) {
    printf("%6d   %s\n",$accesses,$request);

The pattern used in preg_match( ) matches Combined Log Format lines such as: - david [20/Jul/2001:13:05:02 -0400] "GET /sklar.css HTTP/1.0" 200 
278 "-" "Mozilla/4.77 [en] (WinNT; U)" - - [14/Mar/2002:13:31:37 -0500] "GET /php-cookbook/colors.html 
HTTP/1.1" 200 460 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

In the first line, is the IP address that the request came from. Depending on the server configuration, this could be a hostname instead. When the $matches array is assigned to the list of separate variables, the hostname is stored in $remote_host. The next hyphen (-) means that the remote host didn't supply a username via identd,[7] so $logname is set to -.

[7]identd, defined in RFC 1413, is supposed to be a good way to identify users remotely. However, it's not very secure or reliable. A good explanation of why is at http://www.clock.org/~fair/opinion/identd.html.

The string david is a username provided by the browser using HTTP Basic Authentication and is put in $user. The date and time of the request, stored in $time, is in brackets. This date and time format isn't understood by strtotime( ), so if you wanted to do calculations based on request date and time, you have to do some further processing to extract each piece of the formatted time string. Next, in quotes, is the first line of the request. This is composed of the method (GET, POST, HEAD, etc.) which is stored in $method; the requested URI, which is stored in $request, and the protocol, which is stored in $protocol. For GET requests, the query string is part of the URI. For POST requests, the request body that contains the variables isn't logged.

After the request comes the request status, stored in $status. Status 200 means the request was successful. After the status is the size in bytes of the response, stored in $bytes. The last two elements of the line, each in quotes, are the referring page if any, stored in $referer[8] and the user agent string identifying the browser that made the request, stored in $user_agent.

[8]The correct way to spell this word is "referrer." However, since the original HTTP specification (RFC 1945) misspelled it as "referer," the three-R spelling is frequently used in context.

Once the log file line has been parsed into distinct variables, you can do the needed calculations. In this case, just keep a counter in the $requests array of how many times each URI is requested. After looping through all lines in the file, print out a sorted, formatted list of requests and counts.

Calculating statistics this way from web server access logs is easy, but it's not very flexible. The program needs to be modified for different kinds of reports, restricted date ranges, report formatting, and many other features. A better solution for comprehensive web site statistics is to use a program such as analog, available for free at http://www.analog.cx. It has many types of reports and configuration options that should satisfy just about every need you may have.

Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.