home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl & LWPSearch this book

6.2. Regular Expression Techniques

Web pages are designed to be easy for humans to read, not for programs. Humans are very flexible in what they can read, and they can easily adapt to a new look and feel of the web page. But if the underlying HTML changes, a program written to extract information from the page will no longer work. Your challenge when writing a data-extraction program is to get a feel for the amount of natural variation between pages you'll want to download.

The following are a set of techniques for you to use when creating regular expressions to extract data from web pages. If you're an experienced Perl programmer, you probably know most or all of them and can skip ahead to Section 6.3, "Troubleshooting".

6.2.8. Use Multiple Steps

A common conceit in programmers is to try to do everything with one regular expression. Don't be afraid to use two or more. This has the same advantages as building your regular expression from components: by only attempting to solve one part of the problem at each step, the final solution can be easier to read, debug, and maintain.

For example, the front page of http://www.oreillynet.com/ has several articles on it. Inspecting the HTML with View Source on the browser shows that each story looks like this:

<!-- itemtemplate -->
<p class="medlist"><b><a outsideurl=/pub/a/dotnet/2002/03/04
/rotor.html">Uncovering Rotor -- A Shared Source CLI</a></b>&nbsp;^M
 Recently, David Stutz and Stephen Walli hosted an informal, unannounced BOF at 
BSDCon 2002 about Microsoft's Shared Source implementation of the ECMA CLI, also 
known as Rotor. Although the source code for the Shared Source CLI wasn't yet 
available, the BOF offered a preview of what's to come, as well as details about its 
implementation and the motivation behind it. &nbsp;[<a href="http://www.oreillynet.
com/dotnet/">.NET DevCenter</a>]</p>

That is, the article starts with the itemtemplate comment and ends with the </p> tag. This suggests a main loop of:

while ($html =~ m{<!-- itemtemplate -->(.*?)</p>}gs) {
  $chunk = $1;
  # extract URL, title, and summary from $chunk
}

It's surprisingly common to see HTML comments indicating the structure of the HTML. Most dynamic web sites are generated from templates, the comments help the people who maintain the templates keep track of the various sections.

Extracting the URL, title, and summary is straightforward. It's even a simple matter to use the standard Text::Wrap module to reformat the summary to make it easy to read:

use Text::Wrap;

while ($html =~ m{<!-- itemtemplate -->(.*?)</p>}gs) {
  $chunk = $1;
  ($URL, $title, $summary) =
     $chunk =~ m{href="(.*?)">(.*?)</a></b>\s*&nbsp;\s*(.*?)\[}i
     or next;
  $summary =~ s{&nbsp;}{ }g;
  print "$URL\n$title\n", wrap("  ", "  ", $summary), "\n\n";
}

Running this, however, shows HTML still in the summary. Remove the tags with:

$summary =~ s{<.*?>}{}sg;

The complete program is shown in Example 6-3.

Example 6-3. orn-summary

#!/usr/bin/perl -w

use LWP::Simple;
use Text::Wrap;

$html = get("http://www.oreillynet.com/") || die;

while ($html =~ m{<!-- itemtemplate -->(.*?)</p>}gs) {
  $chunk = $1;
  ($URL, $title, $summary) =
     $chunk =~ m{href="(.*?)">(.*?)</a></b>\s*&nbsp;\s*(.*?)\[}i
     or next;
  $summary =~ s{&nbsp;}{ }g;
  $summary =~ s{<.*?>}{}sg;
  print "$URL\n$title\n", wrap("  ", "  ", $summary), "\n\n";
}


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.