<!-- New titles -->
<h3>New Titles</h3>
<ul><li><a href="netwinformian/">.NET Windows Forms in a
Nutshell</a> <em>(March)</em></li><li><a href="actscrptpr/">
ActionScript for Flash MX Pocket Reference</a> <em>(March)</em>
</li><li><a href="abcancer/">After Breast Cancer</a> <em>(March)
...
<li><a href="samba2/">Using Samba, 2nd Edition</a> <em>(February)
</em></li><li><a href="vbscriptian2/">VBScript in a Nutshell, 2nd
Edition</a> <em>(March)</em></li><li><a href="tpj2/">Web, Graphics
& Perl/Tk</a> <em>(March)</em></li></ul></td>
<td valign="top">
<!-- Upcoming titles -->
In fact, it's even uglier than this at the time of this
writing—there are no newlines in the list of new books. It's
all on one long line. Fortunately, this turns out to be comparatively
simple to match. First we extract the HTML for the new titles, and
then we extract the individual book links using the list anchors to
anchor the regular expression:
($new_titles) = $html =~ m{<!-- New titles -->(.*?)<!-- Upcoming titles -->}s
or die "Couldn't find new titles HTML";
while (m{<li> # list item
<a\ href="
([^\"]+) # link to book = $1 = everything to next quote
\">
([^<]+) # book title = $2 = everything up to </a>
</a>\ <em>\(
([^)]+) # month = $3 = everything in the parentheses
}gx) {
printf("%-1010s%s\n", $3, $2); # could use $1 if we wanted
}
March .NET Windows Forms in a Nutshell
March ActionScript for Flash MX Pocket Reference
March After Breast Cancer
...
February Using Samba, 2nd Edition
March VBScript in a Nutshell, 2nd Edition
March Web, Graphics & Perl/Tk
Regular expressions are difficult for this problem because they force
you to work at the level of characters. The CPAN module
HTML::TokeParser treats your HTML file as a series of HTML-y things:
starting tags, closing tags, text, comments, etc. It decodes entities
for you automatically, so you don't have to worry about converting
& back into & in
your code.
Starting tags have four more values in the token array: the tag name
(in lowercase), a reference to a hash of attributes (lowercased
attribute name as key), a reference to an array containing lowercased
attribute names in the order they appeared in the tag, and a string
containing the opening tag as it appeared in the text of the
document. Parsing the following HTML:
<IMg SRc="/perl6.jpg" ALT="Steroidal Camel">
creates a token like this:
[ 'S',
'img',
{ "src" => "/perl6.jpg",
"alt" => "Steroidal Camel"
},
[ "src", "alt" ],
'<IMg SRc="/perl6.jpg" ALT="Steroidal Camel">'
]
Since ending tags have fewer possibilities than opening tags, it
follows that their tokens have a simpler structure. A token for an
end tag contains "E" (identifying it as an end
tag), the lowercased name of the tag being closed (e.g.,
"body"), and the tag as it appeared in the source
(e.g., "</BODY>").
A token for a text tag has three values: "T" (to
identify it as a text token), the text, and a flag identifying
whether you need to decode entities on it (decode only if this flag
is false).
use HTML::Entities qw(decode_entities);
if ($token->[0] eq "T") {
$text = $token->[1];
decode_entities($text) unless $token->[2];
# do something with $text
}
Even simpler, a comment token contains only "C"
(to indicate that it is a comment) followed by the comment text.
For a detailed introduction to parsing with tokens, see
Perl & LWP by Sean Burke (O'Reilly).