Chapter 4. URLsContents: Parsing URLs Now that you've seen how LWP models HTTP requests and responses, let's study the facilities it provides for working with URLs. A URL tells you how to get to something: "use HTTP with this host and request this," "connect via FTP to this host and retrieve this file," or "send email to this address." The great variety inherent in URLs is both a blessing and a curse. On one hand, you can stretch the URL syntax to address almost any type of network resource. However, this very flexibility means attempts to parse arbitrary URLs with regular expressions rapidly run into a quagmire of special cases. The LWP suite of modules provides the URI class to manage URLs. This chapter describes how to create objects that represent URLs, extract information from those objects, and convert between absolute and relative URLs. This last task is particularly useful for link checkers and spiders, which take partial URLs from HTML links and turn those into absolute URLs to request. 4.1. Parsing URLsRather than attempt to pull apart URLs with regular expressions, which is difficult to do in a way that works with all the many types of URLs, you should use the URI class. When you create an object representing a URL, it has attributes for each part of a URL (scheme, username, hostname, port, etc.). Make method calls to get and set these attributes. Example 4-1 creates a URI object representing a complex URL, then calls methods to discover the various components of the URL. Example 4-1. Decomposing a URLuse URI; my $url = URI->new('http://user:pass@example.int:4345/hello.php?user=12'); print "Scheme: ", $url->scheme( ), "\n"; print "Userinfo: ", $url->userinfo( ), "\n"; print "Hostname: ", $url->host( ), "\n"; print "Port: ", $url->port( ), "\n"; print "Path: ", $url->path( ), "\n"; print "Query: ", $url->query( ), "\n"; Example 4-1 prints: Scheme: http Userinfo: user:pass Hostname: example.int Port: 4345 Path: /hello.php Query: user=12 Besides reading the parts of a URL, methods such as host( ) can also alter the parts of a URL, using the familiar convention that $object->method reads an attribute's value and $object->method(newvalue) alters an attribute: use URI; my $uri = URI->new("http://www.perl.com/I/like/pie.html"); $uri->host('testing.perl.com'); print $uri,"\n"; http://testing.perl.com/I/like/pie.html Now let's look at the methods in more depth. 4.1.1. ConstructorsAn object of the URI class represents a URL. (Actually, a URI object can also represent a kind of URL-like string called a URN, but you're unlikely to run into one of those any time soon.) To create a URI object from a string containing a URL, use the new( ) constructor: $url = URI->new(url [, scheme ]); If url is a relative URL (a fragment such as staff/alicia.html), scheme determines the scheme you plan for this URL to have (http, ftp, etc.). But in most cases, you call URI->new only when you know you won't have a relative URL; for relative URLs or URLs that just might be relative, use the URI->new_abs method, discussed below. The URI module strips out quotes, angle brackets, and whitespace from the new URL. So these statements all create identical URI objects: $url = URI->new('<http://www.oreilly.com/>'); $url = URI->new('"http://www.oreilly.com/"'); $url = URI->new(' http://www.oreilly.com/'); $url = URI->new('http://www.oreilly.com/ '); The URI class automatically escapes any characters that the URL standard (RFC 2396) says can't appear in a URL. So these two are equivalent: $url = URI->new('http://www.oreilly.com/bad page'); $url = URI->new('http://www.oreilly.com/bad%20page'); If you already have a URI object, the clone( ) method will produce another URI object with identical attributes: $copy = $url->clone( ); Example 4-2 clones a URI object and changes an attribute. Example 4-2. Cloning a URIuse URI; my $url = URI->new('http://www.oreilly.com/catalog/'); $dup = $url->clone( ); $url->path('/weblogs'); print "Changed path: ", $url->path( ), "\n"; print "Original path: ", $dup->path( ), "\n"; When run, Example 4-2 prints: Changed path: /weblogs Original path: /catalog/ 4.1.2. OutputTreat a URI object as a string and you'll get the URL: $url = URI->new('http://www.example.int'); $url->path('/search.cgi'); print "The URL is now: $url\n"; The URL is now: http://www.example.int/search.cgi You might find it useful to normalize the URL before printing it: $url->canonical( ); Exactly what this does depends on the specific type of URL, but it typically converts the hostname to lowercase, removes the port if it's the default port (for example, http://www.eXample.int:80 becomes http://www.example.int), makes escape sequences uppercase (e.g., %2e becomes %2E), and unescapes characters that don't need to be escaped (e.g., %41 becomes A). In Chapter 12, "Spiders", we'll walk through a program that harvests data but avoids harvesting the same URL more than once. It keeps track of the URLs it's visited in a hash called %seen_url_before; if there's an entry for a given URL, it's been harvested. The trick is to call canonical on all URLs before entering them into that hash and before checking whether one exists in that hash. If not for calling canonical, you might have visited http://www.example.int:80 in the past, and might be planning to visit http://www.EXample.int, and you would see no duplication there. But when you call canonical on both, they both become http://www.example.int, so you can tell you'd be harvesting the same URL twice. If you think such duplication problems might arise in your programs, when in doubt, call canonical right when you construct the URL, like so: $url = URI->new('http://www.example.int')->canonical; 4.1.3. ComparisonTo compare two URLs, use the eq( ) method: if ($url_one->eq(url_two)) { ... } For example: use URI; my $url_one = URI->new('http://www.example.int'); my $url_two = URI->new('http://www.example.int/search.cgi'); $url_one->path('/search.cgi'); if ($url_one->eq($url_two)) { print "The two URLs are equal.\n"; } The two URLs are equal. Two URLs are equal if they are represented by the same string when normalized. The eq( ) method is faster than the eq string operator: if ($url_one eq $url_two) { ... } # inefficient! To see if two values refer not just to the same URL, but to the same URI object, use the == operator: if ($url_one == $url_two) { ... } For example: use URI; my $url = URI->new('http://www.example.int'); $that_one = $url; if ($that_one == $url) { print "Same object.\n"; } Same object. 4.1.4. Components of a URLA generic URL looks like Figure 4-1. Figure 4-1. Components of a URLThe URI class provides methods to access each component. Some components are available only on some schemes (for example, mailto: URLs do not support the userinfo, server, or port components). In addition to the obvious scheme( ), userinfo( ), server( ), port( ), path( ), query( ), and fragment( ) methods, there are some useful but less-intuitive ones.
$url = URI->new('http://www.example.int/eye/sea/ewe.cgi'); @bits = $url->path_segments( ); for ($i=0; $i < @bits; $i++) { print "$i {$bits[$i]}\n"; } print "\n\n"; 0 {} 1 {eye} 2 {sea} 3 {ewe.cgi}
For a URL that simply lacks one of those parts, the method for that part generally returns undef: use URI; my $uri = URI->new("http://stuff.int/things.html"); my $query = $uri->query; print defined($query) ? "Query: <$query>\n" : "No query\n"; No query However, some kinds of URLs can't have certain components. For example, a mailto: URL doesn't have a host component, so code that calls host( ) on a mailto: URL will die. For example: use URI; my $uri = URI->new('mailto:hey-you@mail.int'); print $uri->host; Can't locate object method "host" via package "URI::mailto" This has real-world implications. Consider extracting all the URLs in a document and going through them like this: foreach my $url (@urls) { $url = URI->new($url); my $hostname = $url->host; next unless $Hosts_to_ignore{$hostname}; ...otherwise ... } This will die on a mailto: URL, which doesn't have a host( ) method. You can avoid this by using can( ) to see if you can call a given method: foreach my $url (@urls) { $url = URI->new($url); next unless $uri->can('host'); my $hostname = $url->host; ... or a bit less directly: foreach my $url (@urls) { $url = URI->new($url); unless('http' eq $uri->scheme) { print "Odd, $url is not an http url! Skipping.\n"; next; } my $hostname = $url->host; ...and so forth... Because all URIs offer a scheme method, and all http: URIs provide a host( ) method, this is assuredly safe.[1] For the curious, what URI schemes allow for what is explained in the documentation for the URI class, as well as the documentation for some specific subclasses like URI::ldap.
4.1.5. QueriesThe URI class has two methods for dealing with query data above and beyond the query( ) and path_query( ) methods we've already discussed. In the very early days of the web, queries were simply text strings. Spaces were encoded as plus (+) characters: http://www.example.int/search?i+like+pie The query_keywords( ) method works with these types of queries, accepting and returning a list of keywords: @words = $url->query_keywords([keywords, ...]); For example: use URI; my $url = URI->new('http://www.example.int/search?i+like+pie'); @words = $url->query_keywords( ); print $words[-1], "\n"; pie More modern queries accept a list of named values. A name and its value are separated by an equals sign (=), and such pairs are separated from each other with ampersands (&): http://www.example.int/search?food=pie&action=like The query_form( ) method lets you treat each such query as a list of keys and values: @params = $url->query_form([key,value,...); For example: use URI; my $url = URI->new('http://www.example.int/search?food=pie&action=like'); @params = $url->query_form( ); for ($i=0; $i < @params; $i++) { print "$i {$params[$i]}\n"; } 0 {food} 1 {pie} 2 {action} 3 {like} Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|