6.6. Example: Extracting Linksfrom Arbitrary HTMLSuppose that the links we want to check are in a remote HTML file that's not quite as rigidly formatted as my local bookmark file. Suppose, in fact, that a representative section looks like this: <p>Dear Diary, <br>I was listening to <a href="http://www.freshair.com">Fresh Air</a> the other day and they had <a href ="http://www.cs.Helsinki.FI/u/torvalds/">Linus Torvalds</a> on, and he was going on about how he wrote some kinda <a href="http://www.linux.org/">program</a> or something. If he's so smart, why didn't he write something useful, like <a href="why_I_love_tetris.html">Tetris</a> or <a href="../minesweeper_hints/" >Minesweeper</a>, huh? In the case of the bookmarks, we noted that links were each alone on a line, all absolute, and each capturable with m/ HREF="([^"\s]+)" /. But none of those things are true here! Some links (such as href="why_I_love_tetris.html") are relative, some lines have more than one link in them, and one link even has a newline between its href attribute name and its ="..." attribute value. Regexps are still usable, though—it's just a matter of applying them to a whole document (instead of to individual lines) and also making the regexp a bit more permissive:
(The /g modifier ("g" originally for "globally") on the regexp tries to match the pattern as many times as it can, each time picking up where the last match left off.) Example 6-5 shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine (currently a placeholder) on it. Example 6-5. diary-link-checker
When run, this prints: I should check http://www.freshair.com/ I should check http://www.cs.Helsinki.FI/u/torvalds/ I should check http://www.linux.org/ I should check http://chichi.diaries.int/stuff/why_I_love_tetris.html I should check http://chichi.diaries.int/minesweeper_hints/ So our while (regexp) loop is indeed successfully matching all five links in the document. (Note that our absolutize routine is correctly making the URLs absolute, as with turning why_I_love_tetris.html into http://chichi.diaries.int/stuff/why_I_love_tetris.html and ../minesweeper_hints/ into http://chichi.diaries.int/minesweeper_hints/ by using the URI class that we explained in Chapter 4, "URLs".) Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check_url routine from the Example 6-4, and it will actually check the URLs that the our placeholder check_url routine promised we'd check.
Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|