<A HREF="http://www.perl.com/">Home page</A>
<IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif">
would return a data structure like this:
[
[ a, href => "http://www.perl.com/" ],
[ img, src => "images/big.gif",
lowsrc => "images/big-lowres.gif" ]
]
Here's an example of how to use $elt_type and
$attr_name to print out and anchor an image:
if ($elt_type eq 'a' && $attr_name eq 'href') {
print "ANCHOR: $attr_value\n"
if $attr_value->scheme =~ /http|ftp/;
}
if ($elt_type eq 'img' && $attr_name eq 'src') {
print "IMAGE: $attr_value\n";
}
To extract links only to MP3 files, you'd say:
foreach my $linkarray (@links) {
my ($elt_type, %attrs) = @$linkarray;
if ($elt_type eq 'a' && $attrs{'href'} =~ /\.mp3$/i) {
# do something with $attr{'href'}, the URL of the mp3 file
}
}
Example 20-2. xurl
#!/usr/bin/perl -w
# xurl - extract unique, sorted list of links from URL
use HTML::LinkExtor;
use LWP::Simple;
$base_url = shift;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen) { print $_, "\n" }
This program does have a limitation: if the get of
$base_url involves a redirection, links resolve
using the original URL instead of the URL after the redirection. To
fix this, fetch the document with LWP::UserAgent and examine the
response code to find out whether a redirection occurred. Once you
know the post-redirection URL (if any), construct the HTML::LinkExtor
object accordingly.
Here's an example of the run:
% xurl http://www.perl.com/CPAN
ftp://ftp@ftp.perl.com/CPAN/CPAN.html
http://language.perl.com/misc/CPAN.cgi
http://language.perl.com/misc/cpan_module
http://language.perl.com/misc/getcpan
http://www.perl.com/index.html
http://www.perl.com/gifs/lcb.xbm