18.1 URL Access

A URL identifies a resource on the Internet. A URL is a string composed of several optional parts, called components, known as scheme, location, path, query, and fragment. A URL with all its parts looks something like:

scheme://lo.ca.ti.on/pa/th?query#fragment

For example, in http://www.python.org:80/faq.cgi?src=fie, the scheme is http, the location is www.python.org:80, the path is /faq.cgi, the query is src=fie, and there is no fragment. Some of the punctuation characters form a part of one of the components they separate, while others are just separators and are part of no component. Omitting punctuation implies missing components. For example, in mailto:me@you.com, the scheme is mailto, the path is me@you.com, and there is no location, query, or fragment. The missing // means the URL has no location part, the missing ? means it has no query part, and the missing # means it has no fragment part.

18.1.1 The urlparse Module

The urlparse module supplies functions to analyze and synthesize URL strings. In Python 2.2, the most frequently used functions of module urlparse are urljoin, urlsplit, and urlunsplit.

urljoin

urljoin(base_url_string,relative_url_string)

Returns a URL string u, obtained by joining relative_url_string, which may be relative, with base_url_string. The joining procedure that urljoin performs to obtain its result u may be summarized as follows:

When either of the argument strings is empty, u is the other argument.
When relative_url_string explicitly specifies a scheme different from that of base_url_string, u is relative_url_string. Otherwise, u's scheme is that of base_url_string.
When the scheme does not allow relative URLs (e.g., mailto), or relative_url_string explicitly specifies a location (even when it is the same as the location of base_url_string), all other components of u are those of relative_url_string. Otherwise, u's location is that of base_url_string.
u's path is obtained by joining the paths of base_url_string and relative_url_string according to standard syntax for absolute and relative URL paths. For example:
```
import urlparse
urlparse.urljoin(
    'http://somehost.com/some/path/here',
    '../other/path')
# Result is: 'http://somehost.com/some/other/path'
```

urlsplit

urlsplit(url_string,default_scheme='',allow_fragments=True)

Analyzes url_string and returns a tuple with five string items: scheme, location, path, query, and fragment. default_scheme is the first item when the url_string lacks a scheme. When allow_fragments is False, the tuple's last item is always '', whether or not url_string has a fragment. Items corresponding to missing parts are always ''. For example:

urlparse.urlsplit(
    'http://www.python.org:80/faq.cgi?src=fie')
# Result is: 
# ('http','www.python.org:80','/faq.cgi','src=fie','')

urlunsplit

urlunsplit(url_tuple)

url_tuple is a tuple with exactly five items, all strings. For example, any return value from a urlsplit call is an acceptable argument for urlunsplit. urlunsplit returns a URL string with the given components and the needed separators, but with no redundant separators (e.g., there is no # in the result when the fragment, url_tuple's last item, is ''). For example:

urlparse.urlunsplit(('http','www.python.org:80',
    '/faq.cgi','src=fie',''))
# Result is: 'http://www.python.org:80/faq.cgi?src=fie'

urlunsplit(urlsplit(x)) returns a normalized form of URL string x, not necessarily equal to x because x need not be normalized. For example:

urlparse.urlunsplit(
    urlparse.urlsplit('http://a.com/path/a?'))
# Result is: 'http://a.com/path/a'

In this case, the normalization ensures that redundant separators, such as the trailing ? in the argument to urlsplit, are not present in the result.

Module urlparse also supplies functions urlparse and urlunparse. In Python 2.1, urlparse did not supply urlsplit and urlunsplit, so you had to use urlparse and urlunparse instead. urlparse and urlunparse are akin to urlsplit and urlunsplit, but are based on six components rather than five. The parse functions insert a parameters component between path and query using an older standard for URLs, where parameters applied to the entire path. According to the current standard, parameters apply to each part of the path separately. Therefore, the path URL component may now include parameters to subdivide in further phases of the analysis. For example:

u.urlsplit('http://a.com/path;with/some;params?anda=query')
# Result is: ('http','a.com','/path;with/some;params','anda=query','')
u.urlparse('http://a.com/path;with/some;params?anda=query')
# Result is: ('http','a.com','/path;with/some','params','anda=query','')

In this code, urlparse is able to split off the ';params' part of the parameters, but considers the '/path;with/some' substring to be the path. urlsplit considers the entire '/path;with/some;params' to be the path, returned as the third item in the result tuple. Should you then need to separate the 'with' and 'params' parameters parts of the path component, you can perform further string processing on the third item of urlsplit's return tuple, such as splitting on / and then on ;. In practice, very few URLs on the Net make use of parameters, so you may not care about this subtle distinction.

18.1.2 The urllib Module

The urllib module supplies simple functions to read data from URLs. urllib supports the following protocols (schemes): http, https, ftp, gopher, and file. file indicates a local file. urllib uses file as the default scheme for URLs that lack an explicit scheme. You can find simple, typical examples of urllib use in Chapter 22 and Chapter 23, where urllib.urlopen is used to fetch HTML and XML pages that various examples parse and analyze.

18.1.2.1 Functions

Module urllib supplies a number of functions, with urlopen being the most frequently used.

quote

quote(str,safe='/')

Returns a copy of str where special characters are changed into Internet-standard quoted form %xx. Does not quote alphanumeric characters, spaces, any of the characters '_,.-', nor any of the characters in string safe.

quote_plus

quote_plus(str, safe='/')

Like quote, but also changes spaces into plus signs.

unquote

unquote(str)

Returns a copy of str where each quoted form %xx is changed into the corresponding character.

unquote_plus

unquote_plus(str)

Like unquote, but also changes plus signs into spaces.

urlcleanup

urlcleanup(  )

Clears the cache of function urlretrieve, covered later in this section.

urlencode

urlencode(query,doseq=False)

Returns a string with the URL-encoded form of query. query can be either a sequence of (name, value) pairs, or a mapping, in which case the resulting string encodes the mapping's (key, value) pairs. For example:

urllib.urlencode([('ans',42),('key','val')])
# 'ans=42&key=val'
urllib.urlencode({'ans':42, 'key':'val'})
# 'key=val&ans=42'

Remember that the order of items in a dictionary is not defined: if you need the URL-encoded form to have the key/value pairs in a specific order, use a sequence as the query argument, as in the first call in this example.

When doseq is true, any value in query that is a sequence is encoded as separate parameters, one per item in value. For example:

u.urlencode([('K',('x','y','z'))],1)
# 'K=x&K=y&K=z'
u.urlencode([('K',('x','y','z'))],0)
# 'K=%28%27x%27%2C+%27y%27%2C+%27z%27%29'

When doseq is false (the default), each value is encoded as the quote_plus of its string form given by built-in str, whether the value is a sequence or not.

urlopen

urlopen(urlstring,data=None)

Accesses the given URL and returns a read-only file-like object f. f supplies file-like methods read, readline, readlines, and close, as well as two others:

f.geturl( ): Returns the URL of f. This may differ from urlstring both because of normalization (as mentioned for function urlunsplit earlier) and because the server may issue HTTP redirects (i.e., indications that the requested data is located elsewhere). urllib supports redirects transparently, and method geturl lets you check for them if you want.
f.info( ): Returns an instance m of class Message of module mimetools, covered in Chapter 21. The main use of m is as a container of headers holding metadata about f. For example, m['Content-Type'] is the MIME type and subtype of the data in f. You can also access this information by calling m's methods m.gettype( ), m.getmaintype( ), and m.getsubtype( ).

When data is None and urlstring's scheme is http, urlopen sends a GET request. When data is not None, urlstring's scheme must be http, and urlopen sends a POST request. data must then be in URL-encoded form, and you normally prepare it with function urlencode, covered earlier in this section.

urlopen can transparently use proxies that do not require authentication. Set environment variables http_proxy, ftp_proxy, and gopher_proxy to the proxies' URLs to exploit this. You normally perform such settings in your system's environment, in platform-dependent ways, before you start Python. On the Macintosh only, urlopen transparently and implicitly retrieves proxy URLs from your Internet configuration settings. urlopen does not support proxies that require authentication—for such advanced needs, use the richer and more complicated library module urllib2, covered in a moment.

urlretrieve

urlretrieve(urlstring,filename=None,reporthook=None,data=None)

Similar to urlopen(urlstring,data), but instead returns a pair (f,m). f is a string that specifies the path to a file on the local filesystem. m is an instance of class Message of module mimetools, like the result of method info called on the result value of urlopen, covered earlier in this section.

When filename is None, urlretrieve copies retrieved data to a temporary local file, and f is the path to the temporary local file. When filename is not None, urlretrieve copies retrieved data to the file named filename, and f is filename. When reporthook is not None, it must be a callable with three arguments, as in the function:

def reporthook(block_count, block_size, file_size):
    print block_count

urlretrieve calls reporthook zero or more times while retrieving data. At each call, it passes block_count, the number of blocks of data retrieved so far; block_size, the size in bytes of each block; and file_size, the total size of the file in bytes. urlretrieve passes file_size as -1 when unable to determine file size, which depends on the protocol involved and on how completely the server implements that protocol. The purpose of reporthook is to let your program give graphical or textual feedback to the user about the progress of the file retrieval operation that urlretrieve performs.

18.1.2.2 The FancyURLopener class

You normally use module urllib through the functions it supplies (most often urlopen). To customize urllib's functionality, however, you can subclass urllib's FancyURLopener class and bind an instance of your subclass to attribute _urlopener of module urllib. The customizable aspects of an instance f of a subclass of FancyURLopener are the following.

prompt_user_passwd

f.prompt_user_passwd(host,realm)

Returns a pair (user,password) to use to authenticate access to host in the security realm. The default implementation in class FancyURLopener prompts the user for this data in interactive text mode. Your subclass can override this method for such purposes as interacting with the user via a GUI or fetching authentication data from persistent storage.

version

f.version

The string that f uses to identify itself to the server, for example via the User-Agent header in the HTTP protocol. You can override this attribute by subclassing, or rebind it directly on an instance of FancyURLopener.

18.1.3 The urllib2 Module

The urllib2 module is a rich, highly customizable superset of module urllib. urllib2 lets you work directly with rather advanced aspects of protocols such as HTTP. For example, you can send requests with customized headers as well as URL-encoded POST bodies, and handle authentication in various realms, in both Basic and Digest forms, directly or via HTTP proxies.

In the rest of this section, I cover only the ways in which urllib2 lets your program customize these advanced aspects of URL retrieval. I do not try to impart the advanced knowledge of HTTP and other network protocols, independent of Python, that you need to make full use of urllib2's rich functionality. As an HTTP tutorial, I recommend Python Web Programming, by Steve Holden (New Riders): it offers good coverage of HTTP basics with examples coded in Python, and a good bibliography if you need further details about network protocols.

18.1.3.1 Functions

urllib2 supplies a function urlopen basically identical to urllib's urlopen. To customize urllib2's behavior, you can install, before calling urlopen, any number of handlers grouped into an opener using the build_opener and install_opener functions.

You can also optionally pass to urlopen an instance of class Request instead of a URL string. Such an instance may include both a URL string and supplementary information on how to access it, as covered shortly in Section 18.1.3.2.

build_opener

build_opener(*handlers)

Creates and returns an instance of class OpenerDirector, covered later in this chapter, with the given handlers. Each handler can be a subclass of class BaseHandler, instantiable without arguments, or an instance of such a subclass, however instantiated. build_opener adds instances of various handler classes provided by module urllib2 in front of the handlers you specify, to handle proxies, unknown schemes, the http, file, and https schemes, HTTP errors, and HTTP redirects. However, if you have instances or subclasses of said classes in handlers, this indicates that you want to override these defaults.

install_opener

install_opener(opener)

Installs opener as the opener for further calls to urlopen. opener can be an instance of class OpenerDirector, such as the result of a call to function build_opener, or any signature-compatible object.

urlopen

urlopen(url,data=None)

Almost identical to the urlopen function in module urllib. However, you customize behavior via the opener and handler classes of urllib2, covered later in this chapter, rather than via class FancyURLopener as in module urllib. Argument url can be a URL string, like for the urlopen function in module urllib. Alternatively, url can be an instance of class Request, covered in the next section.

18.1.3.2 The Request class

You can optionally pass to function urlopen an instance of class Request instead of a URL string. Such an instance can embody both a URL and, optionally, other information on how to access the target URL.

Request

class Request(urlstring,data=None,headers={})

urlstring is the URL that this instance of class Request embodies. For example, if there are no data and headers, calling:

urllib2.urlopen(urllib2.Request(urlstring))

is just like calling:

urllib2.urlopen(urlstring)

When data is not None, the Request constructor implicitly calls on the new instance r its method r.add_data(data). headers must be a mapping of header names to header values. The Request constructor executes the equivalent of the loop:

for k,v in headers.items(  ): r.add_header(k,v)

An instance r of class Request supplies the following methods.

add_data

r.add_data(data)

Sets data as r's data. Calling urlopen(r) then becomes like calling urlopen(r,data), i.e., it requires r's scheme to be http, and uses a POST request with a body of data, which must be a URL-encoded string.

Despite its name, method add_data does not necessarily add the data. If r already had data, set in r's constructor or by previous calls to r.add_data, the latest call to r.add_data replaces the previous value of r's data with the new given one. In particular, r.add_data(None) removes r's previous data, if any.

add_header

r.add_header(key,value)

Adds a header with the given key and value to r's headers. If r's scheme is http, r's headers are sent as part of the request. When you add more than one header with the same key, later additions overwrite previous ones, so out of all headers with one given key, only the one given last matters.

get_data

r.get_data(  )

Returns the data of r, either None or a URL-encoded string.

get_full_url

r.get_full_url(  )

Returns the URL of r, as given in the constructor for r.

get_host

r.get_host(  )

Returns the host component of r's URL.

get_selector

r.get_selector(  )

Returns the selector components of r's URL (i.e., the path and all following components).

get_type

r.get_type(  )

Returns the scheme component of r's URL (i.e., the protocol).

has_data

r.has_data(  )

Like r.get_data( ) is not None.

set_proxy

r.set_proxy(host,scheme)

Sets r to use a proxy at the given host and scheme for accessing r's URL.

18.1.3.3 The OpenerDirector class

An instance d of class OpenerDirector collects instances of handler classes and orchestrates their use to open URLs of various schemes and to handle errors. Normally, you create d by calling function build_opener, and then install it by calling function install_opener. For advanced uses, you may also access various attributes and methods of d, but this is a rare need and I do not cover it further in this book.

18.1.3.4 Handler classes

Module urllib2 supplies a class BaseHandler to use as the superclass of any custom handler classes you write. urllib2 also supplies many concrete subclasses of BaseHandler that handle schemes gopher, ftp, http, https, and file, as well as authentication, proxies, redirects, and errors. Writing custom handlers is an advanced topic and I do not cover it further in this book.

18.1.3.5 Handling authentication

urllib2 's default opener does not include authentication handlers. To get authentication, call build_opener to build an opener that includes instances of classes HTTPBasicAuthHandler, ProxyBasicAuthHandler, HTTPDigestAuthHandler, and/or ProxyDigestAuthHandler, depending on whether you need the authentication to be directly in HTTP or to a proxy, and on whether you need Basic or Digest authentication.

To instantiate each of these authentication handlers, use an instance x of class HTTPPasswordMgrWithDefaultRealm as the only argument to the authentication handler's constructor. You normally use the same x to instantiate all the authentication handlers you need. To record users and passwords for given authentication realms and URLs, call x.add_password one or more times.

add_password

x.add_password(realm,URLs,user,password)

Records in x the pair (user,password) as the authentication in the given realm of applicable URLs, as determined by argument URLs. realm is either a string, the name of an authentication realm, or None, to apply this authentication as the default for any realm not specifically recorded. URLs is a URL string or a sequence of URL strings. A URL u is deemed applicable for this authentication if there is an item u1 of URLs such that the location components of u and u1 are equal, and the path component of u1 is a prefix of that of u. Note that other components (scheme, query, and fragment) don't matter to applicability for authentication purposes.

The following example shows how to use urllib2 with basic HTTP authentication:

import urllib2

x = urllib2.HTTPPasswordMgrWithDefaultRealm(  )
x.add_password(None, 'http://myhost.com/', 'auser',
               'apassword')
auth = urrlib2.HTTPBasicAuthHandler(x)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)

flob = urllib2.urlopen('http://myhost.com/index.html')
for line in flob.readlines(  ): print line,