6.6. Example: Extracting Linksfrom Arbitrary HTMLSuppose that the links we want to check are in a remote HTML file that's not quite as rigidly formatted as my local bookmark file. Suppose, in fact, that a representative section looks like this: <p>Dear Diary, <br>I was listening to <a href="http://www.freshair.com">Fresh Air</a> the other day and they had <a href ="http://www.cs.Helsinki.FI/u/torvalds/">Linus Torvalds</a> on, and he was going on about how he wrote some kinda <a href="http://www.linux.org/">program</a> or something. If he's so smart, why didn't he write something useful, like <a href="why_I_love_tetris.html">Tetris</a> or <a href="../minesweeper_hints/" >Minesweeper</a>, huh? In the case of the bookmarks, we noted that links were each alone on a line, all absolute, and each capturable with m/ HREF="([^"\s]+)" /. But none of those things are true here! Some links (such as href="why_I_love_tetris.html") are relative, some lines have more than one link in them, and one link even has a newline between its href attribute name and its ="..." attribute value. Regexps are still usable, though—it's just a matter of applying them to a whole document (instead of to individual lines) and also making the regexp a bit more permissive: while ( $document =~ m/\s+href\s*=\s*"([^"\s]+)"/gi ) { my $url = $1; ... } (The /g modifier ("g" originally for "globally") on the regexp tries to match the pattern as many times as it can, each time picking up where the last match left off.) Example 6-5 shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine (currently a placeholder) on it. Example 6-5. diary-link-checker#!/usr/bin/perl -w # diary-link-checker - check links from diary page use strict; use LWP; my $doc_url = "http://chichi.diaries.int/stuff/diary.html"; my $document; my $browser; init_browser( ); { # Get the page whose links we want to check: my $response = $browser->get($doc_url); die "Couldn't get $doc_url: ", $resp->status_line unless $response->is_success; $document = $response->content; $doc_url = $response->request->base; # In case we need to resolve relative URLs later } while ($document =~ m/href\s*=\s*"([^"\s]+)"/gi) { my $absolute_url = absolutize($1, $doc_url); check_url($absolute_url); } sub absolutize { my($url, $base) = @_; use URI; return URI->new_abs($url, $base)->canonical; } sub init_browser { $browser = LWP::UserAgent->new; # ...And any other initialization we might need to do... return $browser; } sub check_url { # A temporary placeholder... print "I should check $_[0]\n"; } When run, this prints: I should check http://www.freshair.com/ I should check http://www.cs.Helsinki.FI/u/torvalds/ I should check http://www.linux.org/ I should check http://chichi.diaries.int/stuff/why_I_love_tetris.html I should check http://chichi.diaries.int/minesweeper_hints/ So our while (regexp) loop is indeed successfully matching all five links in the document. (Note that our absolutize routine is correctly making the URLs absolute, as with turning why_I_love_tetris.html into http://chichi.diaries.int/stuff/why_I_love_tetris.html and ../minesweeper_hints/ into http://chichi.diaries.int/minesweeper_hints/ by using the URI class that we explained in Chapter 4, "URLs".) Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check_url routine from the Example 6-4, and it will actually check the URLs that the our placeholder check_url routine promised we'd check. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|