Web Automation (PHP Cookbook)

11.1. Introduction

Most of the time, PHP is part of a web server, sending content to browsers. Even when you run it from the command line, it usually performs a task and then prints some output. PHP can also be useful, however, playing the role of a web browser — retrieving URLs and then operating on the content. Most recipes in this chapter cover retrieving URLs and processing the results, although there are a few other tasks in here as well, such as using templates and processing server logs.

There are four ways to retrieve a remote URL in PHP. Choosing one method over another depends on your needs for simplicity, control, and portability. The four methods are to use fopen( ) , fsockopen( ), the cURL extension, or the HTTP_Request class from PEAR.

Using fopen( ) is simple and convenient. We discuss it in Recipe 11.2. The fopen( ) function automatically follows redirects, so if you use this function to retrieve the directory http://www.example.com/people and the server redirects you to http://www.example.com/people/, you'll get the contents of the directory index page, not a message telling you that the URL has moved. The fopen( ) function also works with both HTTP and FTP. The downsides to fopen( ) include: it can handle only HTTP GET requests (not HEAD or POST), you can't send additional headers or any cookies with the request, and you can retrieve only the response body with it, not response headers.

Using fsockopen( ) requires more work but gives you more flexibility. We use fsockopen( ) in Recipe 11.3. After opening a socket with fsockopen( ), you need to print the appropriate HTTP request to that socket and then read and parse the response. This lets you add headers to the request and gives you access to all the response headers. However, you need to have additional code to properly parse the response and take any appropriate action, such as following a redirect.

If you have access to the cURL extension or PEAR's HTTP_Request class, you should use those rather than fsockopen( ). cURL supports a number of different protocols (including HTTPS, discussed in Recipe 11.6) and gives you access to response headers. We use cURL in most of the recipes in this chapter. To use cURL, you must have the cURL library installed, available at http://curl.haxx.se. Also, PHP must be built with the --with-curl configuration option.

PEAR's HTTP_Request class, which we use in Recipe 11.3, Recipe 11.4, and Recipe 11.5, doesn't support HTTPS, but does give you access to headers and can use any HTTP method. If this PEAR module isn't installed on your system, you can download it from http://pear.php.net/get/HTTP_Request. As long as the module's files are in your include_path, you can use it, making it a very portable solution.

Recipe 11.7 helps you go behind the scenes of an HTTP request to examine the headers in a request and response. If a request you're making from a program isn't giving you the results you're looking for, examining the headers often provides clues as to what's wrong.

Once you've retrieved the contents of a web page into a program, use Recipe 11.8 through Recipe 11.12 to help you manipulate those page contents. Recipe 11.8 demonstrates how to mark up certain words in a page with blocks of color. This technique is useful for highlighting search terms, for example. Recipe 11.9 provides a function to find all the links in a page. This is an essential building block for a web spider or a link checker. Converting between plain ASCII and HTML is covered in Recipe 11.10 and Recipe 11.11. Recipe 11.12 shows how to remove all HTML and PHP tags from a web page.

Another kind of page manipulation is using a templating system. Discussed in Recipe 11.13, templates give you freedom to change the look and feel of your web pages without changing the PHP plumbing that populates the pages with dynamic data. Similarly, you can make changes to the code that drives the pages without affecting the look and feel. Recipe 11.14 discusses a common server administration task — parsing your web server's access log files.

Two sample programs use the link extractor from Recipe 11.9. The program in Recipe 11.15 scans the links in a page and reports which are still valid, which have been moved, and which no longer work. The program in Recipe 11.16 reports on the freshness of links. It tells you when a linked-to page was last modified and if it's been moved.

Chapter 11. Web Automation

Contents:

11.1. Introduction