18.1 URL Access
A
URL identifies a resource on the Internet. A URL is a string composed
of several optional parts, called components, known as scheme,
location, path, query, and fragment. A URL with all its parts looks
something like:
scheme://lo.ca.ti.on/pa/th?query#fragment
For example, in http://www.python.org:80/faq.cgi?src=fie, the
scheme is http, the location is
www.python.org:80, the path is
/faq.cgi, the query is
src=fie, and there is no fragment. Some of the
punctuation characters form a part of one of the components they
separate, while others are just separators and are part of no
component. Omitting punctuation implies missing components. For
example, in mailto:me@you.com,
the scheme is mailto, the path is
me@you.com, and there is no location, query, or
fragment. The missing // means the URL has no
location part, the missing ? means it has no
query part, and the missing # means it has no
fragment part.
18.1.1 The urlparse Module
The urlparse module
supplies functions to analyze and synthesize URL strings. In Python
2.2, the most frequently used functions of module
urlparse are urljoin,
urlsplit, and urlunsplit.
urljoin(base_url_string,relative_url_string)
|
|
Returns a URL string u, obtained by
joining relative_url_string, which may be
relative, with base_url_string. The
joining procedure that urljoin performs to obtain
its result u may be summarized as follows:
When either of the argument strings is empty,
u is the other argument.
When relative_url_string explicitly
specifies a scheme different from that of
base_url_string,
u is
relative_url_string. Otherwise,
u's scheme is that of
base_url_string.
When the scheme does not allow relative URLs (e.g.,
mailto), or
relative_url_string explicitly specifies a
location (even when it is the same as the location of
base_url_string), all other components of
u are those of
relative_url_string. Otherwise,
u's location is that of
base_url_string.
u's path is obtained by
joining the paths of base_url_string and
relative_url_string according to standard
syntax for absolute and relative URL paths. For example: import urlparse
urlparse.urljoin(
'http://somehost.com/some/path/here',
'../other/path')
# Result is: 'http://somehost.com/some/other/path'
urlsplit(url_string,default_scheme='',allow_fragments=True)
|
|
Analyzes url_string and returns a tuple
with five string items: scheme, location, path, query, and fragment.
default_scheme is the first item when the
url_string lacks a scheme. When
allow_fragments is
False, the tuple's last item is
always '', whether or not
url_string has a fragment. Items
corresponding to missing parts are always ''. For
example:
urlparse.urlsplit(
'http://www.python.org:80/faq.cgi?src=fie')
# Result is:
# ('http','www.python.org:80','/faq.cgi','src=fie','')
url_tuple is a tuple with exactly five
items, all strings. For example, any return value from a
urlsplit call is an acceptable argument for
urlunsplit. urlunsplit returns
a URL string with the given components and the needed separators, but
with no redundant separators (e.g., there is no #
in the result when the fragment,
url_tuple's last item, is
''). For example:
urlparse.urlunsplit(('http','www.python.org:80',
'/faq.cgi','src=fie',''))
# Result is: 'http://www.python.org:80/faq.cgi?src=fie' urlunsplit(urlsplit(x))
returns a normalized form of URL string x,
not necessarily equal to x because
x need not be normalized. For example:
urlparse.urlunsplit(
urlparse.urlsplit('http://a.com/path/a?'))
# Result is: 'http://a.com/path/a' In this case, the normalization ensures that redundant separators,
such as the trailing ? in the argument to
urlsplit, are not present in the result.
Module urlparse also supplies functions
urlparse and urlunparse. In
Python 2.1, urlparse did not supply
urlsplit and urlunsplit, so you
had to use urlparse and
urlunparse instead. urlparse
and urlunparse are akin to
urlsplit and urlunsplit, but
are based on six components rather than five. The
parse functions insert a
parameters component between
path and query
using an older standard for URLs, where parameters applied to the
entire path. According to the current standard, parameters apply to
each part of the path separately. Therefore, the path URL component
may now include parameters to subdivide in further phases of the
analysis. For example:
u.urlsplit('http://a.com/path;with/some;params?anda=query')
# Result is: ('http','a.com','/path;with/some;params','anda=query','')
u.urlparse('http://a.com/path;with/some;params?anda=query')
# Result is: ('http','a.com','/path;with/some','params','anda=query','')
In this code, urlparse is able to split off the
';params' part of the parameters, but considers
the '/path;with/some' substring to be the path.
urlsplit considers the entire
'/path;with/some;params' to be the path, returned
as the third item in the result tuple. Should you then need to
separate the 'with' and
'params' parameters parts of the path component,
you can perform further string processing on the third item of
urlsplit's return tuple, such as
splitting on / and then on ;.
In practice, very few URLs on the Net make use of parameters, so you
may not care about this subtle distinction.
18.1.2 The urllib Module
The
urllib module supplies simple functions to read
data from URLs. urllib supports the following
protocols (schemes): http,
https, ftp,
gopher, and file.
file indicates a local file.
urllib uses file as the
default scheme for URLs that lack an explicit scheme. You can find
simple, typical examples of urllib use in Chapter 22 and Chapter 23, where
urllib.urlopen is used to fetch HTML and XML pages
that various examples parse and analyze.
18.1.2.1 Functions
Module urllib supplies a number of functions, with
urlopen being the most frequently used.
Returns a copy of str where special
characters are changed into Internet-standard quoted form
%xx. Does not quote
alphanumeric characters, spaces, any of the characters
'_,.-', nor any of the characters in string
safe.
quote_plus(str, safe='/')
|
|
Like quote, but also changes spaces into plus
signs.
Returns a copy of str where each quoted
form %xx is changed
into the corresponding character.
Like unquote, but also changes plus signs into
spaces.
Clears the cache of function urlretrieve, covered
later in this section.
urlencode(query,doseq=False)
|
|
Returns a string with the URL-encoded form of
query. query
can be either a sequence of
(name,
value) pairs, or a
mapping, in which case the resulting string encodes the
mapping's
(key,
value) pairs. For
example:
urllib.urlencode([('ans',42),('key','val')])
# 'ans=42&key=val'
urllib.urlencode({'ans':42, 'key':'val'})
# 'key=val&ans=42' Remember that the order of items in a dictionary is not defined: if
you need the URL-encoded form to have the key/value pairs in a
specific order, use a sequence as the
query argument, as in the first call in
this example.
When doseq is true, any
value in query
that is a sequence is encoded as separate parameters, one per item in
value. For example:
u.urlencode([('K',('x','y','z'))],1)
# 'K=x&K=y&K=z'
u.urlencode([('K',('x','y','z'))],0)
# 'K=%28%27x%27%2C+%27y%27%2C+%27z%27%29' When doseq is false (the default), each
value is encoded as the quote_plus of its string
form given by built-in str, whether the value is a
sequence or not.
urlopen(urlstring,data=None)
|
|
Accesses the given URL and returns a read-only file-like object
f. f supplies
file-like methods read,
readline, readlines, and
close, as well as two
others:
- f.geturl( )
-
Returns the URL of
f. This may differ from
urlstring both because of normalization
(as mentioned for function urlunsplit earlier) and
because the server may issue HTTP redirects (i.e., indications that
the requested data is located elsewhere). urllib
supports redirects transparently, and method
geturl lets you check for them if you
want.
- f.info( )
-
Returns an instance m of class
Message of module mimetools,
covered in Chapter 21. The main use of
m is as a container of headers holding
metadata about f. For example,
m['Content-Type'] is
the MIME type and subtype of the data in
f. You can also access this information by
calling m's methods
m.gettype( ),
m.getmaintype( ), and
m.getsubtype( ).
When data is None and
urlstring's scheme is
http, urlopen sends a
GET request. When data
is not None,
urlstring's scheme must
be http, and urlopen sends a
POST request. data must then be in
URL-encoded form, and you normally prepare it with function
urlencode, covered earlier in this section.
urlopen can transparently use proxies that do not
require authentication. Set environment variables
http_proxy, ftp_proxy, and
gopher_proxy to the proxies' URLs
to exploit this. You normally perform such settings in your
system's environment, in platform-dependent ways,
before you start Python. On the Macintosh only,
urlopen transparently and implicitly retrieves
proxy URLs from your Internet configuration settings.
urlopen does not support proxies that require
authentication—for such advanced needs, use the richer and more
complicated library module urllib2, covered in a
moment.
urlretrieve(urlstring,filename=None,reporthook=None,data=None)
|
|
Similar to
urlopen(urlstring,data),
but instead returns a pair
(f,m).
f is a string that specifies the path to a
file on the local filesystem. m is an
instance of class Message of module
mimetools, like the result of method
info called on the result value of
urlopen, covered earlier in this section.
When filename is None,
urlretrieve copies retrieved data to a temporary
local file, and f is the path to the
temporary local file. When filename is not
None, urlretrieve copies
retrieved data to the file named filename,
and f is
filename. When
reporthook is not None,
it must be a callable with three arguments, as in the function:
def reporthook(block_count, block_size, file_size):
print block_count urlretrieve calls
reporthook zero or more times while
retrieving data. At each call, it passes
block_count, the number of blocks of data
retrieved so far; block_size, the size in
bytes of each block; and file_size, the
total size of the file in bytes. urlretrieve
passes file_size as -1
when unable to determine file size, which depends on the protocol
involved and on how completely the server implements that protocol.
The purpose of reporthook is to let your
program give graphical or textual feedback to the user about the
progress of the file retrieval operation that
urlretrieve performs.
18.1.2.2 The FancyURLopener class
You normally use module urllib through the
functions it supplies (most often urlopen). To
customize urllib's functionality,
however, you can subclass
urllib's
FancyURLopener class and bind an instance of your
subclass to attribute _urlopener of module
urllib. The customizable aspects of an instance
f of a subclass of
FancyURLopener are the
following.
f.prompt_user_passwd(host,realm)
|
|
Returns a pair
(user,password)
to use to authenticate access to host in
the security realm. The default
implementation in class FancyURLopener prompts the
user for this data in interactive text mode. Your subclass can
override this method for such purposes as interacting with the user
via a GUI or fetching authentication data from persistent storage.
The
string that f uses to identify itself to
the server, for example via the User-Agent header in the HTTP
protocol. You can override this attribute by subclassing, or rebind
it directly on an instance of
FancyURLopener.
18.1.3 The urllib2 Module
The urllib2 module is
a rich, highly customizable superset of module
urllib. urllib2 lets you work
directly with rather advanced aspects of protocols such as HTTP. For
example, you can send requests with customized headers as well as
URL-encoded POST bodies, and handle authentication in various realms,
in both Basic and Digest forms, directly or via HTTP proxies.
In the rest of this section, I cover only the ways in which
urllib2 lets your program customize these advanced
aspects of URL retrieval. I do not try to impart the advanced
knowledge of HTTP and other network protocols, independent of Python,
that you need to make full use of
urllib2's rich functionality. As
an HTTP tutorial, I recommend Python Web
Programming, by Steve Holden (New Riders): it offers good
coverage of HTTP basics with examples coded in Python, and a good
bibliography if you need further details about network protocols.
18.1.3.1 Functions
urllib2 supplies a function
urlopen basically identical to
urllib's
urlopen. To customize
urllib2's behavior, you can
install, before calling urlopen, any number of
handlers grouped into an opener using the
build_opener and install_opener
functions.
You can also optionally pass to urlopen an
instance of class Request instead of a URL string.
Such an instance may include both a URL string and supplementary
information on how to access it, as covered shortly in Section 18.1.3.2.
Creates and returns an instance of class
OpenerDirector, covered later in this chapter,
with the given handlers. Each handler can
be a subclass of class BaseHandler, instantiable
without arguments, or an instance of such a subclass, however
instantiated. build_opener adds instances of
various handler classes provided by module urllib2
in front of the handlers you specify, to handle proxies, unknown
schemes, the http, file,
and https schemes, HTTP errors, and HTTP
redirects. However, if you have instances or subclasses of said
classes in handlers, this indicates that
you want to override these defaults.
Installs opener as the opener for further
calls to urlopen.
opener can be an instance of class
OpenerDirector, such as the result of a call to
function build_opener, or any signature-compatible
object.
Almost identical to the urlopen function in module
urllib. However, you customize behavior via the
opener and handler classes of urllib2, covered
later in this chapter, rather than via class
FancyURLopener as in module
urllib. Argument url
can be a URL string, like for the urlopen function
in module urllib. Alternatively,
url can be an instance of class
Request, covered in the next section.
18.1.3.2 The Request class
You can optionally pass to function urlopen an
instance of class Request instead of a URL string.
Such an instance can embody both a URL and, optionally, other
information on how to access the target
URL.
class Request(urlstring,data=None,headers={})
|
|
urlstring is the URL that this instance of
class Request embodies. For example, if there are
no data and
headers, calling:
urllib2.urlopen(urllib2.Request(urlstring)) is just like calling:
urllib2.urlopen(urlstring) When data is not None,
the Request constructor implicitly calls on the
new instance r its method
r.add_data(data).
headers must be a mapping of header names
to header values. The Request constructor executes
the equivalent of the loop:
for k,v in headers.items( ): r.add_header(k,v)
An instance r of class
Request supplies the following methods.
Sets data as
r's data. Calling
urlopen(r)
then becomes like calling
urlopen(r,data),
i.e., it requires r's
scheme to be http, and uses a POST request with
a body of data, which must be a
URL-encoded string.
Despite its name, method add_data does not
necessarily add the data. If
r already had data, set in
r's constructor or by
previous calls to
r.add_data, the latest
call to r.add_data
replaces the previous value of
r's data with the new
given one. In particular,
r.add_data(None)
removes r's previous
data, if any.
Adds a header with the given key and
value to
r's headers. If
r's scheme is
http,
r's headers are sent as
part of the request. When you add more than one header with the same
key, later additions overwrite previous
ones, so out of all headers with one given
key, only the one given last matters.
Returns the data of r, either
None or a URL-encoded string.
Returns the URL of r, as given in the
constructor for r.
Returns the host component of
r's URL.
Returns the selector components of
r's URL (i.e., the path
and all following components).
Returns the scheme component of
r's URL (i.e., the
protocol).
Like r.get_data( )
is not None.
Sets r to use a proxy at the given
host and scheme
for accessing r's URL.
18.1.3.3 The OpenerDirector class
An
instance d of class
OpenerDirector collects instances of handler
classes and orchestrates their use to open URLs of various schemes
and to handle errors. Normally, you create
d by calling function
build_opener, and then install it by calling
function install_opener. For advanced uses, you
may also access various attributes and methods of
d, but this is a rare need and I do not
cover it further in this book.
18.1.3.4 Handler classes
Module
urllib2 supplies a class
BaseHandler to use as the superclass of any custom
handler classes you write. urllib2 also supplies
many concrete subclasses of BaseHandler that
handle schemes gopher, ftp,
http, https, and
file, as well as authentication, proxies,
redirects, and errors. Writing custom handlers is an advanced topic
and I do not cover it further in this book.
18.1.3.5 Handling authentication
urllib2
's
default opener does not include authentication handlers. To get
authentication, call build_opener to build an
opener that includes instances of classes
HTTPBasicAuthHandler,
ProxyBasicAuthHandler,
HTTPDigestAuthHandler, and/or
ProxyDigestAuthHandler, depending on whether you
need the authentication to be directly in HTTP or to a proxy, and on
whether you need Basic or Digest authentication.
To instantiate each of these
authentication handlers, use an instance x
of class HTTPPasswordMgrWithDefaultRealm as the
only argument to the authentication handler's
constructor. You normally use the same x
to instantiate all the authentication handlers you need. To record
users and passwords for given authentication realms and URLs, call
x.add_password one or
more times.
x.add_password(realm,URLs,user,password)
|
|
Records
in x the pair
(user,password)
as the authentication in the given realm
of applicable URLs, as determined by argument
URLs. realm is
either a string, the name of an authentication realm, or
None, to apply this authentication as the default
for any realm not specifically recorded.
URLs is a URL string or a sequence of URL
strings. A URL u is deemed applicable for
this authentication if there is an item u1
of URLs such that the
location components of
u and u1 are
equal, and the path component of
u1 is a prefix of that of
u. Note that other components (scheme,
query, and fragment) don't matter to applicability
for authentication purposes.
The following example shows how to use urllib2
with basic HTTP authentication:
import urllib2
x = urllib2.HTTPPasswordMgrWithDefaultRealm( )
x.add_password(None, 'http://myhost.com/', 'auser',
'apassword')
auth = urrlib2.HTTPBasicAuthHandler(x)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)
flob = urllib2.urlopen('http://myhost.com/index.html')
for line in flob.readlines( ): print line,
|