20.3. Extracting URLsProblemYou want to extract all URLs from an HTML file. SolutionUse the HTML::LinkExtor module from CPAN: use HTML::LinkExtor; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse_file($filename); @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; # element type # possibly test whether this is an element we're interested in while (@element) { # extract the next attribute and its value my ($attr_name, $attr_value) = splice(@element, 0, 2); # ... do something with them ... } } Discussion
You can use HTML::LinkExtor in two different ways: either to call
The <A HREF="http://www.perl.com/">Home page</A> <IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif"> would return a data structure like this: [ [ a, href => "http://www.perl.com/" ], [ img, src =>"images/big.gif", lowsrc => "images/big-lowres.gif" ] ]
Here's an example of how you would use the if ($elt_type eq 'a' && $attr_name eq 'href') { print "ANCHOR: $attr_value\n" if $attr_value->scheme =~ /http|ftp/; } if ($elt_type eq 'img' && $attr_name eq 'src') { print "IMAGE: $attr_value\n"; } Example 20.2 is a complete program that takes as its arguments a URL, like file:///tmp/testing.html or http://www.ora.com/, and produces on standard output an alphabetically sorted list of unique URLs. Example 20.2: xurl#!/usr/bin/perl -w # xurl - extract unique, sorted list of links from URL use HTML::LinkExtor; use LWP::Simple; $base_url = shift; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse(get($base_url))->eof; @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; while (@element) { my ($attr_name , $attr_value) = splice(@element, 0, 2); $seen{$attr_value}++; } } for (sort keys %seen) { print $_, "\n" }
This program does have a limitation: if the Here's an example of the run: % xurl http://www.perl.com/CPAN Often in mail or Usenet messages, you'll see URLs written as: <URL:http://www.perl.com> This is supposed to make it easy to pick URLs from messages: @URLs = ($message =~ /<URL:(.*?)>/g); See AlsoThe documentation for the CPAN modules LWP::Simple, HTML::LinkExtor, and HTML::Entities; Recipe 20.1 |
|