home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Chapter 11  TOC  Chapter 13

Chapter 12. Server-Side Scripting

12.1 "Oh What a Tangled Web We Weave"

This chapter is the third part of our look at Python Internet programming. In the last two chapters, we explored sockets and basic client-side programming interfaces such as FTP and email. In this chapter, our main focus will be on writing server-side scripts in Python -- a type of program usually referred to as CGI scripts. Server-side scripting and its derivatives are at the heart of much of what happens on the Web these days.

As we'll see, Python makes an ideal language for writing scripts to implement and customize web sites, due to both its ease of use and its library support. In the following two chapters, we will use the basics we learn in this chapter to implement full-blown web sites. After that, we will wrap up with a chapter that looks at other Internet-related topics and technologies. Here, our goal is to understand the fundamentals of server-side scripting, before exploring systems that build upon that basic model.

A House upon the Sand

As you read the next three chapters of this book, please keep in mind that they are intended only as an introduction to server-side scripting with Python. The webmaster domain is large and complex, changes continuously, and often prescribes many ways to accomplish a given goal -- some of which can vary from browser to browser and server to server. For instance, the password encryption scheme of the next chapter may be unnecessary under certain scenarios, and special HTML tags may sometimes obviate some work we'll do here.

Given such a large and shifting knowledge base, this part of the book does not even pretend to be a complete look at the server-side scripting domain. To become truly proficient in this area, you should study other texts for additional webmaster-y details and tricks (e.g., O'Reilly's HTML & XHTML: The Definitive Guide). Here, you will meet Python's CGI toolset and learn enough to start writing substantial web sites of your own in Python. But you should not take this text as the final word on the subject.

12.2 What's a Server-Side CGI Script?

Simply put, CGI scripts implement much of the interaction you typically experience on the Web. They are a standard and widely used mechanism for programming web site interaction. There are other ways to add interactive behavior to web sites with Python, including client-side solutions (e.g., JPython applets and Active Scripting), as well as server-side technologies, which build upon the basic CGI model (e.g., Active Server Pages and Zope), and we will discuss these briefly at the end of Chapter 15, too. But by and large, CGI server-side scripts are used to program much of the activity on the Web.

12.2.1 The Script Behind the Curtain

Formally speaking, CGI scripts are programs that run on a server machine and adhere to the Common Gateway Interface -- a model for browser/server communications, from which CGI scripts take their name. Perhaps a more useful way to understand CGI, though, is in terms of the interaction it implies.

Most people take this interaction for granted when browsing the Web and pressing buttons in web pages, but there is a lot going on behind the scenes of every transaction on the Web. From the perspective of a user, it's a fairly familiar and simple process:

1.      Submission. When you visit a web site to purchase a product or submit information online, you generally fill in a form in your web browser, press a button to submit your information, and begin waiting for a reply.

2.      Response. Assuming all is well with both your Internet connection and the computer you are contacting, you eventually get a reply in the form of a new web page. It may be a simple acknowledgement (e.g, "Thanks for your order") or a new form that must be filled out and submitted again.

And, believe it or not, that simple model is what makes most of the Web hum. But internally, it's a bit more complex. In fact, there is a subtle client/server socket-based architecture at work -- your web browser running on your computer is the client, and the computer you contact over the Web is the server. Let's examine the interaction scenario again, with all the gory details that users usually never see.

Submission

When you fill out a form page in a web browser and press a submission button, behind the scenes your web browser sends your information across the Internet to the server machine specified as its receiver. The server machine is usually a remote computer that lives somewhere else in both cyberspace and reality. It is named in the URL you access (the Internet address string that appears at the top of your browser). The target server and file can be named in a URL you type explicitly, but more typically they are specified in the HTML that defines the submission page itself -- either in a hyperlink, or in the "action" tag of a form's HTML. However the server is specified, the browser running on your computer ultimately sends your information to the server as bytes over a socket, using techniques we saw in the last two chapters. On the server machine, a program called an HTTP server runs perpetually, listening on a socket for incoming data from browsers, usually on port number 80.

Processing

When your information shows up at the server machine, the HTTP server program notices it first and decides how to handle the request. If the requested URL names a simple web page (e.g., a URL ending in .html), the HTTP server opens the named HTML file on the server machine and sends its text back to the browser over a socket. On the client, the browser reads the HTML and uses it to construct the next page you see. But if the URL requested by the browser names an executable program instead (e.g., a URL ending in .cgi), the HTTP server starts the named program on the server machine to process the request and redirects the incoming browser data to the spawned program's stdin input stream and environment variables. That program is usually a CGI script -- a program run on the remote server machine somewhere in cyberspace, not on your computer. The CGI script is responsible for handling the request from this point on; it may store your information in a database, charge your credit card, and so on.

Response

Ultimately, the CGI script prints HTML to generate a new response page in your browser. When a CGI script is started, the HTTP server takes care to connect the script's stdout standard output stream to a socket that the browser is listening to. Because of this, HTML code printed by the CGI script is sent over the Internet, back to your browser, to produce a new page. The HTML printed back by the CGI script works just as if it had been stored and read in from an HTML file; it can define a simple response page or a brand new form coded to collect additional information.

In other words, CGI scripts are something like callback handlers for requests generated by web browsers that require a program to be run dynamically; they are automatically run on the server machine in response to actions in a browser. Although CGI scripts ultimately receive and send standard structured messages over sockets, CGI is more like a higher-level procedural convention for sending and receiving information between a browser and a server.

12.2.2 Writing CGI Scripts in Python

If all of the above sounds complicated, relax -- Python, as well as the resident HTTP server, automates most of the tricky bits. CGI scripts are written as fairly autonomous programs, and they assume that startup tasks have already been accomplished. The HTTP web server program, not the CGI script, implements the server-side of the HTTP protocol itself. Moreover, Python's library modules automatically dissect information sent up from the browser and give it to the CGI script in an easily digested form. The upshot is that CGI scripts may focus on application details like processing input data and producing a result page.

As mentioned earlier, in the context of CGI scripts, the stdin and stdout streams are automatically tied to sockets connected to the browser. In addition, the HTTP server passes some browser information to the CGI script in the form of shell environment variables. To CGI programmers, that means:

·         Input data sent from the browser to the server shows up as a stream of bytes in the stdin input stream, along with shell environment variables.

·         Output is sent back from the server to the client by simply printing properly formatted HTML to the stdout output stream.

The most complex parts of this scheme include parsing all the input information sent up from the browser and formatting information in the reply sent back. Happily, Python's standard library largely automates both tasks:

Input

With the Python cgi module, inputs typed into a web browser form or appended to a URL string show up as values in a dictionary-like object in Python CGI scripts. Python parses the data itself and gives us an object with one key:value pair per input sent by the browser that is fully independent of transmission style (form or URL).

Output

The cgi module also has tools for automatically escaping strings so that they are legal to use in HTML (e.g., replacing embedded <, >, and & characters with HTML escape codes). Module urllib provides other tools for formatting text inserted into generated URL strings (e.g., adding %XX and + escapes).

We'll study both of these interfaces in detail later in this chapter. For now, keep in mind that although any language can be used to write CGI scripts, Python's standard modules and language attributes make it a snap.

Less happily, CGI scripts are also intimately tied to the syntax of HTML, since they must generate it to create a reply page. In fact, it can be said that Python CGI scripts embed HTML, which is an entirely distinct language in its own right. As we'll also see, the fact that CGI scripts create a user interface by printing HTML syntax means that we have to take special care with the text we insert into a web page's code (e.g., escaping HTML operators). Worse, CGI scripts require at least a cursory knowledge of HTML forms, since that is where the inputs and target script's address are typically specified. This book won't teach HTML in-depth; if you find yourself puzzled by some of the arcane syntax of the HTML generated by scripts here, you should glance at an HTML introduction, such as O'Reilly's HTML and XHTML: The Definitive Guide.

12.2.3 Running Server-Side Examples

Like GUIs, web-based systems are highly interactive, and the best way to get a feel for some of these examples is to test-drive them live. Before we get into some code, it's worth noting that all you need to run the examples in the next few chapters is a web browser. That is, all the Web examples we will see here can be run from any web browser on any machine, whether you've installed Python on that machine or not. Simply type this URL at the top:[1]

http://starship.python.net/~lutz/PyInternetDemos.html

That address loads a launcher page with links to all the example files installed on a server machine whose domain name is starship.python.net (a machine dedicated to Python developers). The launcher page itself appears as shown in Figure 12-1, running under Internet Explorer. It looks similar in other browsers. Each major example has a link on this page, which runs when clicked.

Figure 12-1. The PyInternetDemos launcher page

figs/ppy2_1201.gif

The launcher page, and all the HTML files in this chapter, can also be loaded locally, from the book's example distribution directory on your machine. They can even be opened directly off the book's CD (view CD-ROM content online at http://examples.oreilly.com/python2)and may be opened by buttons on the top-level book demo launchers. However, the CGI scripts ultimately invoked by some of the example links must be run on a server, and thus require a live Internet connection. If you browse root pages locally on your machine, your browser will either display the scripts' source code or tell you when you need to connect to the Web to run a CGI script. On Windows, a connection dialog will likely pop up automatically, if needed.

12.2.3.1 Changing server-side examples

Of course, running scripts in your browser isn't quite the same as writing scripts on your own. If you do decide to change these CGI programs or write new ones from scratch, you must be able to access web server machines:

·         To change server-side scripts, you need an account on a web server machine with an installed version of Python. A basic account on such a server is often enough. Then edit scripts on your machine and upload to the server by FTP.

·         To type explicit command lines on a server machine or edit scripts on the server directly, you will need to also have shell access on the web server. Such access lets you telnet to that machine to get a command-line prompt.

Unlike the last chapter's examples, Python server-side scripts require both Python and a server. That is, you'll need access to a web server machine that supports CGI scripts in general and that either already has an installed Python interpreter or lets you install one of your own. Some Internet Service Providers (ISPs) are more supportive than others on this front, but there are many options here, both commercial and free (more on this later).

Once you've located a server to host your scripts, you may modify and upload the CGI source code file from this book's CD to your own server and site by FTP. If you do, you may also want to run two Python command-line scripts on your server after uploading: fixcgi.py and fixsitename.py, both presented later in this chapter. The former sets CGI script permissions, and the latter replaces any starship server name references in example links and forms with your own server's name. We'll study additional installation details later in this chapter, and explore a few custom server options at the end of Chapter 15.

12.2.3.2 Viewing server-side examples and output

The source code of examples in this part of the book is listed in the text and included on the book's CD (see http://examples.oreilly.com/python2). In all cases, if you wish to view the source code of an HTML file, or the HTML generated by a Python CGI script, you can also simply select your browser's View Source menu option while the corresponding web page is displayed.

Keep in mind, though, that your browser's View Source option lets you see the output of a server-side script after it has run, but not the source code of the script itself. There is no automatic way to view the Python source code of the CGI scripts themselves, short of finding them in this book or its CD.

To address this issue, later in this chapter we'll also write a CGI-based program called getfile, which allows the source code of any file on this book's web site (HTML, CGI script, etc.) to be downloaded and viewed. Simply type the desired file's name into a web page form referenced by the getfile.html link on the Internet demos launcher page, or add it to the end of an explicitly typed URL as a parameter like this:

http://.../getfile.cgi?filename=somefile.cgi

In response, the server will ship back the text of the named file to your browser. This process requires explicit interface steps, though, and much more knowledge than we've gained thus far, so see ahead for details.

12.3 Climbing the CGI Learning Curve

Okay, it's time to get into concrete programming details. This section introduces CGI coding one step at a time -- from simple, noninteractive scripts to larger programs that utilize all the common web page user input devices (what we called "widgets" in the Tkinter GUI chapters of Part II). We'll move slowly at first, to learn all the basics; the next two chapters will use the ideas presented here to build up larger and more realistic web site examples. For now, let's work though a simple CGI tutorial, with just enough HTML thrown in to write basic server-side scripts.

12.3.1 A First Web Page

As mentioned, CGI scripts are intimately bound up with HTML, so let's start with a simple HTML page. The file test0.html, shown in Example 12-1, defines a bona fide, fully functional web page -- a text file containing HTML code, which specifies the structure and contents of a simple web page.

Example 12-1. PP2E\Internet\Cgi-Web\Basics\test0.html
<HTML><BODY>
<TITLE>HTML 101</TITLE>
<H1>A First HTML page</H1>
<P>Hello, HTML World!</P>
</BODY></HTML>

If you point your favorite web browser to the Internet address of this file (or to its local path on your own machine), you should see a page like that shown in Figure 12-2. This figure shows the Internet Explorer browser at work; other browsers render the page similarly.

Figure 12-2. A simple web page from an HTML file

figs/ppy2_1202.gif

To truly understand how this little file does its work, you need to know something about permission rules, HTML syntax, and Internet addresses. Let's take a quick first look at each of these topics before we move on to larger examples.

12.3.1.1 HTML file permission constraints

First of all, if you want to install this code on a different machine, it's usually necessary to grant web page files and their directories world-readable permission. That's because they are loaded by arbitrary people over the Web (actually, by someone named "nobody", who we'll introduce in a moment). An appropriate chmod command can be used to change permissions on Unix-like machines. For instance, a chmod 755 filename shell command usually suffices; it makes filename readable and executable by everyone, and writable by you only.[2] These directory and file permission details are typical, but they can vary from server to server. Be sure to find out about the local server's conventions if you upload this file to your site.

12.3.1.2 HTML basics

I promised that I wouldn't teach much HTML in this book, but you need to know enough to make sense of examples. In short, HTML is a descriptive markup language, based on tags -- items enclosed in <> pairs. Some tags stand alone (e.g., <HR> specifies a horizontal rule). Others appear in begin/end pairs where the end tag includes an extra slash.

For instance, to specify the text of a level-1 header line, we write HTML code of the form <H1>text</H1>; the text between the tags shows up on the web page. Some tags also allow us to specify options. For example, a tag pair like <A href="address">text</A> specifies a hyperlink : pressing the link's text in the page directs the browser to access the Internet address (URL) listed in the href option.

It's important to keep in mind that HTML is used only to describe pages: your web browser reads it and translates its description to a web page with headers, paragraphs, links, and the like. Notably absent is both layout information -- the browser is responsible for arranging components on the page -- and syntax for programming logic -- there are no "if" statements, loops, and so on. There is also no Python code in this file anywhere to be found; raw HTML is strictly for defining pages, not for coding programs or specifying all user-interface details.

HTML's lack of user interface control and programmability is both a strength and a weakness. It's well-suited to describing pages and simple user interfaces at a high level. The browser, not you, handles physically laying out the page on your screen. On the other hand, HTML does not directly support full-blown GUIs and requires us to introduce CGI scripts (and other technologies) to web sites, in order to add dynamic programmability to otherwise static HTML.

12.3.1.3 Internet addresses (URLs)

Once you write an HTML file, you need to put it some place where the outside world can find it. Like all HTML files, test0.html must be stored in a directory on the server machine, from which the resident web server program allows browsers to fetch pages. On the server where this example lives, the page's file must be stored in or below the public_html directory of my personal home directory -- that is, somewhere in the directory tree rooted at /home/lutz/public_html. For this section, examples live in a Basics subdirectory, so the complete Unix pathname of this file on the server is:

/home/lutz/public_html/Basics/test0.html 

This path is different than its PP2E\Internet\Cgi-Web\Basics location on the book's CD http://examples.oreilly.com/python2), as given in the example file listing's title. When you reference this file on the client, though, you must specify its Internet address, sometimes called a URL, instead. To load the remote page, type the following text in your browser's address field (or click the example root page's test0.html hyperlink, which refers to same address):

http://starship.python.net/~lutz/Basics/test0.html

This string is a URL composed of multiple parts:

Protocol name: http

The protocol part of this URL tells the browser to communicate with the HTTP server program on the server machine, using the HTTP message protocol. URLs used in browsers can also name different protocols -- for example, ftp:// to reference a file managed by the FTP protocol and server, telnet to start a Telnet client session, and so on.

Server machine name: starship.python.net

A URL also names the target server machine following the protocol type. Here, we list the domain name of the server machine were the examples are installed; the machine name listed is used to open a socket to talk to the server. For HTTP, the socket is usually connected to port number 80.

File path: ~lutz/Basics/test0.html

Finally, the URL gives the path to the desired file on the remote machine. The HTTP web server automatically translates the URL's file path to the file's true Unix pathname: on my server, ~lutz is automatically translated to the public_html directory in my home directory. URLs typically map to such files, but can reference other sorts of items as well.

Parameters (used in later examples)

URLs may also be followed by additional input parameters for CGI programs. When used, they are introduced by a ? and separated by & characters; for instance, a string of the form ?name=bob&job=hacker at the end of a URL passes parameters named name and job to the CGI script named earlier in the URL. These values are sometimes called URL query string parameters and are treated the same as form inputs. More on both forms and parameters in a moment.

For completeness, you should also know that URLs can contain additional information (e.g., the server name part can specify a port number following a :), but we'll ignore these extra formatting rules here. If you're interested in more details, you might start by reading the urlparse module's entry in Python's library manual, as well as its source code in the Python standard library. You might also notice that a URL you type to access a page looks a bit different after the page is fetched (spaces become + characters, %s are added, etc.). This is simply because browsers must also generally follow URL escaping (i.e., translation) conventions, which we'll explore later in this chapter.

12.3.1.4 Using minimal URLs

Because browsers remember the prior page's Internet address, URLs embedded in HTML files can often omit the protocol and server names, as well as the file's directory path. If missing, the browser simply uses these components' values from the last page's address. This minimal syntax works both for URLs embedded in hyperlinks and form actions (we'll meet forms later in this chapter). For example, within a page that was fetched from directory dirpath on server www.server.com, minimal hyperlinks and form actions such as:

<A HREF="more.html">
<FORM ACTION="next.cgi"  ...>

are treated exactly as if we had specified a complete URL with explicit server and path components, like the following:

<A HREF="http://www.server.com/dirpath/more.html">
<FORM ACTION="http://www.server.com/dirpath/next.cgi"  ...>

The first minimal URL refers to file more.html on the same server and in the same directory that the page containing this hyperlink was fetched from; it is expanded to a complete URL within the browser. URLs can also employ Unix-style relative path syntax in the file path component. For instance, a hyperlink tag like <A HREF="../spam.gif"> names a GIF file on the server machine and parent directory of the file that contains this link's URL.

Why all the fuss about shorter URLs? Besides extending the life of your keyboard and eyesight, the main advantage of such minimal URLs is that they don't need to be changed if you ever move your pages to a new directory or server -- the server and path are inferred when the page is used, not hardcoded into its HTML. The flipside of this can be fairly painful: examples that do include explicit site and pathnames in URLs embedded within HTML code cannot be copied to other servers without source code changes. Scripts can help here, but editing source code can be error-prone.[3]

The downside of minimal URLs is that they don't trigger automatic Internet connection when followed. This becomes apparent only when you load pages from local files on your computer. For example, we can generally open HTML pages without connecting to the Internet at all, by pointing a web browser to a page's file that lives on the local machine (e.g., by clicking on its file icon). When browsing a page locally like this, following a fully specified URL makes the browser automatically connect to the Internet to fetch the referenced page or script. Minimal URLs, though, are opened on the local machine again; usually, the browser simply displays the referenced page or script's source code.

The net effect is that minimal URLs are more portable, but tend to work better when running all pages live on the Internet. To make it easier to work with the examples in this book, they will often omit the server and path components in URLs they contain. In this book, to derive a page or script's true URL from a minimal URL, imagine that the string:

http://starship.python.net/~lutz/subdir 

appears before the filename given by the URL. Your browser will, even if you don't.

12.3.2 A First CGI Script

The HTML file we just saw is just that -- an HTML file, not a CGI script. When referenced by a browser, the remote web server simply sends back the file's text to produce a new page in the browser. To illustrate the nature of CGI scripts, let's recode the example as a Python CGI program, as shown in Example 12-2.

Example 12-2. PP2E\Internet\Cgi-Web\Basics\test0.cgi
#!/usr/bin/python
#######################################################
# runs on the server, prints html to create a new page;
# executable permissions, stored in ~lutz/public_html,
# url=http://starship.python.net/~lutz/Basics/test0.cgi
#######################################################
 
print "Content-type: text/html\n"
print "<TITLE>CGI 101</TITLE>"
print "<H1>A First CGI script</H1>"
print "<P>Hello, CGI World!</P>"

This file, test0.cgi, makes the same sort of page if you point your browser at it (simply replace .html with .cgi in the URL). But it's a very different kind of animal -- it's an executable program that is run on the server in response to your access request. It's also a completely legal Python program, in which the page's HTML is printed dynamically, rather than being precoded in a static file. In fact, there is little that is CGI-specific about this Python program at all; if run from the system command line, it simply prints HTML rather than generating a browser page:

C:\...\PP2E\Internet\Cgi-Web\Basics>python test0.cgi
Content-type: text/html
 
<TITLE>CGI 101</TITLE>
<H1>A First CGI script</H1>
<P>Hello, CGI World!</P>

When run by the HTTP server program on a web server machine, however, the standard output stream is tied to a socket read by the browser on the client machine. In this context, all the output is sent across the Internet to your browser. As such, it must be formatted per the browser's expectations. In particular, when the script's output reaches your browser, the first printed line is interpreted as a header, describing the text that follows. There can be more than one header line in the printed response, but there must always be a blank line between the headers and the start of the HTML code (or other data).

In this script, the first header line tells the browser that the rest of the transmission is HTML text (text/html), and the newline character (\n) at the end of the first print statement generates one more line-feed than the print statement itself. The rest of this program's output is standard HTML and is used by the browser to generate a web page on a client, exactly as if the HTML lived in a static HTML file on the server.[4]

CGI scripts are accessed just like HTML files: you either type the full URL of this script into your browser's address field, or click on the test0.cgi link line in the examples root page (which follows a minimal hyperlink that resolves to the script's full URL). Figure 12-3 shows the result page generated if you point your browser at this script to make it go.

Figure 12-3. A simple web page from a CGI script

figs/ppy2_1203.gif

12.3.2.1 Installing CGI scripts

Like HTML files, CGI scripts are simple text files that you can either create on your local machine and upload to the server by FTP, or write with a text editor running directly on the server machine (perhaps using a telnet client). However, because CGI scripts are run as programs, they have some unique installation requirements that differ from simple HTML files. In particular, they usually must be stored and named specially, and they must be configured as programs that are executable by arbitrary users. Depending on your needs, CGI scripts may also need help finding imported modules and may need to be converted to the server platform's text file format after being uploaded. Let's look at each install constraint in more depth:

Directory and filename conventions

First of all, CGI scripts need to be placed in a directory that your web server recognizes as a program directory, and they need to be given a name that your server recognizes as a CGI script. On the server where these examples reside, CGI scripts can be stored in each user's public_html directory just like HTML files, but must have a filename ending in a .cgi suffix, not .py. Some servers allow .py filename suffixes too, and may recognize other program directories (cgi-bin is common), but this varies widely, too, and can sometimes be configured per server or user.

Execution conventions

Because they must be executed by the web server on behalf of arbitrary users on the Web, CGI script files also need to be given executable file permissions to mark them as programs, and they must be made executable by others. Again, a shell command chmod 0755 filename does the trick on most servers. CGI scripts also generally need the special #! line at the top, to identify the Python interpreter that runs the file's code. The text after the #! in the first line simply gives the directory path to the Python executable on your server machine. See Chapter 2, for more details on this special first line, and be sure to check your server's conventions for more details on non-Unix platforms.

One subtlety worth noting. As we saw earlier in the book, the special first line in executable text files can normally contain either a hardcoded path to the Python interpreter (e.g., #!/usr/bin/python) or an invocation of the env program (e.g., #!/usr/bin/env python), which deduces where Python lives from environment variable settings (i.e., your $PATH). The env trick is less useful in CGI scripts, though, because their environment settings are those of user "nobody" (not your own), as explained in the next paragraph.

Module search path configuration (optional)

HTTP servers generally run CGI scripts with username "nobody" for security reasons (this limits the user's access to the server machine). That's why files you publish on the Web must have special permission settings that make them accessible to other users. It also means that CGI scripts can't rely on the Python module search path to be configured in any particular way. As we've seen, the module path is normally initialized from the user's PYTHONPATH setting plus defaults. But because CGI scripts are run by user "nobody", PYTHONPATH may be arbitrary when a CGI script runs.

Before you puzzle over this too hard, you should know that this is often not a concern in practice. Because Python usually searches the current directory for imported modules by default, this is not an issue if all of your scripts and any modules and packages they use are stored in your web directory (which is the installation structure on the book's site). But if the module lives elsewhere, you may need to tweak the sys.path list in your scripts to adjust the search path manually before imports (e.g., with sys.path.append(dirname) calls, index assignments, and so on).

End-of-line conventions (optional)

Finally, on some Unix (and Linux) servers, you might also have to make sure that your script text files follow the Unix end-of-line convention (\n), not DOS (\r\n). This isn't an issue if you edit and debug right on the server (or on another Unix machine) or FTP files one by one in text mode. But if you edit and upload your scripts from a PC to a Unix server in a tar file (or in FTP binary mode), you may need to convert end-of-lines after the upload. For instance, the server that was used to develop this text returns a default error page for scripts whose end-of-lines are in DOS format (see later in this chapter for a converter script).

This installation process may sound a bit complex at first glance, but it's not bad once you've worked through it on your own: it's only a concern at install time and can usually be automated to some extent with Python scripts run on the server. To summarize, most Python CGI scripts are text files of Python code, which:

·         Are named according to your web server's conventions (e.g., file.cgi)

·         Are stored in a directory recognized by your web server (e.g., cgi-bin/ )

·         Are given executable file permissions (e.g., chmod 755 file.cgi)

·         Usually have the special #!pythonpath line at the top (but not env)

·         Configure sys.path only if needed to see modules in other directories

·         Use Unix end-of-line conventions, only if your server rejects DOS format

·         Print headers and HTML to generate a response page in the browser, if any

·         Use the cgi module to parse incoming form data, if any (more about forms later in this chapter)

Even if you must use a server machine configured by someone else, most of the machine's conventions should be easy to root out. For instance, on some servers you can rename this example to test0.py and it will continue to be run when accessed. On others, you might instead see the file's source code in a popped-up text editor when you access it. Try a .cgi suffix if the text is displayed rather than executed. CGI directory conventions can vary, too, but try the directory where you normally store HTML files first. As usual, you should consult the conventions for any machine that you plan to copy these example files to.

12.3.2.2 Automating installation steps

But wait -- why do things the hard way? Before you start installing scripts by hand, remember that Python programs can usually do much of your work for you. It's easy to write Python scripts that automate some of the CGI installation steps using the operating systems tools that we met earlier in the book.

For instance, while developing the examples in this chapter, I did all editing on my PC (it's generally more dependable than a telnet client). To install, I put all the examples in a tar file, which is uploaded to the Linux server by FTP in a single step. Unfortunately, my server expects CGI scripts to have Unix (not DOS) end-of-line markers; unpacking the tar file did not convert end-of-lines or retain executable permission settings. But rather than tracking down all the web CGI scripts and fixing them by hand, I simply run the Python script in Example 12-3 from within a Unix find command after each upload.

Example 12-3. PP2E\Internet\Cgi-Web\fixcgi.py
########################################################################
# run fom a unix find command to automate some cgi script install steps;
# example:  find . -name "*.cgi" -print -exec python fixcgi.py \{} \;
# which converts all cgi scripts to unix line-feed format (needed on 
# starship) and gives all cgi files executable mode, else won't be run;
# do also: chmod 777 PyErrata/DbaseFiles/*, vi Extern/Email/mailconfig*;
# related: fixsitename.py, PyTools/fixeoln*.py, System/Filetools
########################################################################
 
# after: ungzip, untar, cp -r Cgi-Web/* ~/public_html
 
import sys, string, os
fname = sys.argv[1]
old   = open(fname, 'rb').read(  )
new   = string.replace(old, '\r\n', '\n')
open(fname, 'wb').write(new)
if fname[-3:] == 'cgi': os.chmod(fname, 0755)       # note octal int: rwx,sgo

This script is kicked off at the top of the Cgi-Web directory, using a Unix csh shell command to apply it to every CGI file in a directory tree, like this:

% find . -name "*.cgi" -print -exec python fixcgi.py \{} \; 
./Basics/languages-src.cgi
./Basics/getfile.cgi
./Basics/languages.cgi
./Basics/languages2.cgi
./Basics/languages2reply.cgi
./Basics/putfile.cgi
 ...more...

Recall from Chapter 2 that there are various ways to walk directory trees and find matching files in pure Python code, including the find module, os.path.walk, and one we'll use in the next section's script. For instance, a pure Python and more portable alternative could be kicked off like this:

C:\...\PP2E\Internet\Cgi-Web>python 
>>> import os 
>>> from PP2E.PyTools.find import find 
>>> for filename in find('*.cgi', '.'): 
...     print filename 
...     stat = os.system('python fixcgi.py ' + filename) 
...
.\Basics\getfile.cgi
.\Basics\languages-src.cgi
.\Basics\languages.cgi
.\Basics\languages2.cgi
 ...more...

The Unix find command simply does the same, but outside the scope of Python: the command line after -exec is run for each matching file found. For more details about the find command, see its manpage. Within the Python script, string.replace translates to Unix end-of-line markers, and os.chmod works just like a shell chmod command. There are other ways to translate end-of-lines, too; see Chapter 5.

12.3.2.3 Automating site move edits

Speaking of installation tasks, a common pitfall of web programming is that hardcoded site names embedded in HTML code stop working the minute you relocate the site to a new server. Minimal URLs (just the filename) are more portable, but for various reasons are not always used. Somewhere along the way, I also grew tired of updating URLs in hyperlinks and form actions, and wrote a Python script to do it all for me (see Example 12-4).

Example 12-4. PP2E\Internet\Cgi-Web\fixsitename.py
#!/usr/bin/env python
###############################################################
# run this script in Cgi-Web dir after copying book web 
# examples to a new server--automatically changes all starship 
# server references in hyperlinks and form action tags to the 
# new server/site; warns about references that weren't changed
# (may need manual editing); note that starship references are 
# not usually needed or used--since browsers have memory, server 
# and path can usually be omitted from a URL in the prior page 
# if it lives at the same place (e.g., "file.cgi" is assumed to 
# be in the same server/path as a page that contains this name,
# with a real url like "http://lastserver/lastpath/file.cgi"),
# but a handful of URLs are fully specified in book examples;
# reuses the Visitor class developed in the system chapters,
# to visit and convert all files at and below current dir;
###############################################################
 
import os, string
from PP2E.PyTools.visitor import FileVisitor           # os.path.walk wrapper
 
listonly = 0
oldsite  = 'starship.python.net/~lutz'                 # server/rootdir in book
newsite  = 'XXXXXX/YYYYYY'                             # change to your site
warnof   = ['starship.python', 'lutz']                 # warn if left after fix
fixext   = ['.py', '.html', '.cgi']                    # file types to check
 
class FixStarship(FileVisitor):
    def __init__(self, listonly=0):                     # replace oldsite refs
        FileVisitor.__init__(self, listonly=listonly)   # in all web text files
        self.changed, self.warning = [], []             # need diff lists here
    def visitfile(self, fname):                         # or use find.find list
        FileVisitor.visitfile(self, fname)
        if self.listonly:
            return
        if os.path.splitext(fname)[1] in fixext:
            text = open(fname, 'r').read(  )
            if string.find(text, oldsite) != -1:    
                text = string.replace(text, oldsite, newsite)
                open(fname, 'w').write(text)
                self.changed.append(fname)
            for word in warnof:
                if string.find(text, word) != -1:
                    self.warning.append(fname); break
 
if __name__ == '__main__':
    # don't run auto if clicked
    go = raw_input('This script changes site in all web files; continue?') 
    if go != 'y':
        raw_input('Canceled - hit enter key')
    else:
        walker = FixStarship(listonly)
        walker.run(  )
        print 'Visited %d files and %d dirs' % (walker.fcount, walker.dcount)
 
        def showhistory(label, flist):
            print '\n%s in %d files:' % (label, len(flist))
            for fname in flist:
                print '=>', fname
        showhistory('Made changes', walker.changed)
        showhistory('Saw warnings', walker.warning)
 
        def edithistory(flist):
            for fname in flist:                      # your editor here
                os.system('vi ' + fname) 
        if raw_input('Edit changes?') == 'y':  edithistory(walker.changed)
        if raw_input('Edit warnings?') == 'y': edithistory(walker.warning)

This is a more complex script that reuses the visitor.py module we wrote in Chapter 5 to wrap the os.path.walk call. If you read that chapter, this script will make sense. If not, we won't go into many more details here again. Suffice it to say that this program visits all source code files at and below the directory where it is run, globally changing all starship.python.net/~lutz appearances to whatever you've assigned to variable newsite within the script. On request, it will also launch your editor to view files changed, as well as files that contain potentially suspicious strings. As coded, it launches the Unix vi text editor at the end, but you can change this to start whatever editor you like (this is Python, after all):

C:\...\PP2E\Internet\Cgi-Web>python fixsitename.py 
This script changes site in all web files; continue?
. ...
1 => .\PyInternetDemos.html
2 => .\README.txt
3 => .\fixcgi.py
4 => .\fixsitename.py
5 => .\index.html
6 => .\python_snake_ora.gif
.\Basics ...
7 => .\Basics\mlutz.jpg
8 => .\Basics\languages.html
9 => .\Basics\languages-src.cgi
 ...more...
146 => .\PyMailCgi\temp\secret.doc.txt
Visited 146 files and 16 dirs
 
Made changes in 8 files:
=> .\fixsitename.py
=> .\Basics\languages.cgi
=> .\Basics\test3.html
=> .\Basics\test0.py
=> .\Basics\test0.cgi
=> .\Basics\test5c.html
=> .\PyMailCgi\commonhtml.py
=> .\PyMailCgi\sendurl.py
 
Saw warnings in 14 files:
=> .\PyInternetDemos.html
=> .\fixsitename.py
=> .\index.html
=> .\Basics\languages.cgi
 ...more...
=> .\PyMailCgi\pymailcgi.html
=> .\PyMailCgi\commonhtml.py
=> .\PyMailCgi\sendurl.py
Edit changes?
Edit warnings?

The net effect is that this script automates part of the site relocation task: running it will update all pages' URLs for the new site name automatically, which is considerably less aggravating than manually hunting down and editing each such reference by hand.

There aren't many hardcoded starship site references in web examples in this book (the script found and fixed eight above), but be sure to run this script in the Cgi-Web directory from a command line, after copying the book examples to your own site. To use this script for other site moves, simply set both oldsite and newsite as appropriate. The truly ambitious scriptmaster might even run such a script from within another that first copies a site's contents by FTP (see ftplib in the previous chapter).[5]

12.3.2.4 Finding Python on the server

One last install pointer: even though Python doesn't have to be installed on any clients in the context of a server-side web application, it does have to exist on the server machine where your CGI scripts are expected to run. If you are using a web server that you did not configure yourself, you must be sure that Python lives on that machine. Moreover, you need to find where it is on that machine so that you can specify its path in the #! line at the top of your script.

By now, Python is a pervasive tool, so this generally isn't as big a concern as it once was. As time goes by, it will become even more common to find Python as a standard component of server machines. But if you're not sure if or where Python lives on yours, here are some tips:

·         Especially on Unix systems, you should first assume that Python lives in a standard place (e.g., /usr/local/bin/python), and see if it works. Chances are that Python already lives on such machines. If you have Telnet access on your server, a Unix find command starting at /usr may help.

·         If your server runs Linux, you're probably set to go. Python ships as a standard part of Linux distributions these days, and many web sites and Internet Service Providers (ISPs) run the Linux operating system; at such sites, Python probably already lives at /usr/bin/python.

·         In other environments where you cannot control the server machine yourself, it may be harder to obtain access to an already-installed Python. If so, you can relocate your site to a server that does have Python installed, talk your ISP into installing Python on the machine you're trying to use, or install Python on the server machine yourself.

If your ISP is unsympathetic to your need for Python and you are willing to relocate your site to one that is, you can find lists of Python-friendly ISPs by searching http://www.python.org. And if you choose to install Python on your server machine yourself, be sure to check out the freeze tool shipped with the Python source distribution (in the Tools directory). With freeze, you can create a single executable program file that contains the entire Python interpreter, as well as all the standard library modules. Such a frozen interpreter can be uploaded to your web account by FTP in a single step, and it won't require a full-blown Python installation on the server.

12.3.3 Adding Pictures and Generating Tables

Now let's get back to writing server-side code. As anyone who's ever surfed the Web knows, web pages usually consist of more than simple text. Example 12-5 is a Python CGI script that prints an <IMG> HTML tag in its output to produce a graphic image in the client browser. There's not much Python-specific about this example, but note that just as for simple HTML files, the image file (ppsmall.gif ) lives on and is downloaded from the server machine when the browser interprets the output of this script.

Example 12-5. PP2E\Internet\Cgi-Web\Basics\test1.cgi
#!/usr/bin/python
 
text = """Content-type: text/html
 
<TITLE>CGI 101</TITLE>
<H1>A Second CGI script</H1>
<HR>
<P>Hello, CGI World!</P>
<IMG src="ppsmall.gif" BORDER=1 ALT=[image]>
<HR>
"""
 
print text

Notice the use of the triple-quoted string block here; the entire HTML string is sent to the browser in one fell swoop, with the print statement at the end. If client and server are both functional, a page that looks like Figure 12-4 will be generated when this script is referenced and run.

Figure 12-4. A page with an image generated by test1.cgi

figs/ppy2_1204.gif

So far, our CGI scripts have been putting out canned HTML that could have just as easily been stored in an HTML file. But because CGI scripts are executable programs, they can also be used to generate HTML on the fly, dynamically -- even, possibly, in response to a particular set of user inputs sent to the script. That's the whole purpose of CGI scripts, after all. Let's start using this to better advantage now, and write a Python script that builds up response HTML programmatically (see Example 12-6).

Example 12-6. PP2E\Internet\Cgi-Web\Basics\test2.cgi
#!/usr/bin/python
 
print """Content-type: text/html
 
<TITLE>CGI 101</TITLE>
<H1>A Third CGI script</H1>
<HR>
<P>Hello, CGI World!</P>
 
<table border=1>
"""
 
for i in range(5):
    print "<tr>"
    for j in range(4):
        print "<td>%d.%d</td>" % (i, j)
    print "</tr>"
 
print """
</table>
<HR>
"""

Despite all the tags, this really is Python code -- the test2.cgi script uses triple-quoted strings to embed blocks of HTML again. But this time, the script also uses nested Python for loops to dynamically generate part of the HTML that is sent to the browser. Specifically, it emits HTML to lay out a two-dimensional table in the middle of a page, as shown in Figure 12-5.

Figure 12-5. A page with a table generated by test2.cgi

figs/ppy2_1205.gif

Each row in the table displays a "row.column" pair, as generated by the executing Python script. If you're curious how the generated HTML looks, select your browser's View Source option after you've accessed this page. It's a single HTML page composed of the HTML generated by the first print in the script, then the for loops, and finally the last print. In other words, the concatenation of this script's output is an HTML document with headers.

12.3.3.1 Table tags

This script generates HTML table tags. Again, we're not out to learn HTML here, but we'll take a quick look just so you can make sense of the example. Tables are declared by the text between <table> and </table> tags in HTML. Typically, a table's text in turn declares the contents of each table row between <tr> and </tr> tags and each column within a row between <td> and </td> tags. The loops in our script build up HTML to declare five rows of four columns each, by printing the appropriate tags, with the current row and column number as column values. For instance, here is part of the script's output, defining the first two rows:

<table border=1>
<tr>
<td>0.0</td>
<td>0.1</td>
<td>0.2</td>
<td>0.3</td>
</tr>
<tr>
<td>1.0</td>
<td>1.1</td>
<td>1.2</td>
<td>1.3</td>
</tr>
. . .
</table>

Other table tags and options let us specify a row title (<th>), layout borders, and so on. We'll see more table syntax put to use to lay out forms in a later section.

12.3.4 Adding User Interaction

CGI scripts are great at generating HTML on the fly like this, but they are also commonly used to implement interaction with a user typing at a web browser. As described earlier in this chapter, web interactions usually involve a two-step process and two distinct web pages: you fill out a form page and press submit, and a reply page eventually comes back. In between, a CGI script processes the form input.

12.3.4.1 Submission

That description sounds simple enough, but the process of collecting user inputs requires an understanding of a special HTML tag, <form>. Let's look at the implementation of a simple web interaction to see forms at work. First off, we need to define a form page for the user to fill out, as shown in Example 12-7.

Example 12-7. PP2E\Internet\Cgi-Web\Basics\test3.html
<html><body>
<title>CGI 101</title>
<H1>A first user interaction: forms</H1>
<hr>
<form method=POST action="http://starship.python.net/~lutz/Basics/test3.cgi">
    <P><B>Enter your name:</B>
    <P><input type=text name=user>
    <P><input type=submit>
</form>
</BODY></HTML>

test3.html is a simple HTML file, not a CGI script (though its contents could be printed from a script as well). When this file is accessed, all the text between its <form> and </form> tags generate the input fields and Submit button shown in Figure 12-6.

Figure 12-6. A simple form page generated by test3.html

figs/ppy2_1206.gif

12.3.4.2 More on form tags

We won't go into all the details behind coding HTML forms, but a few highlights are worth underscoring. Within a form's HTML code:

·         The form's action option gives the URL of a CGI script that will be invoked to process submitted form data. This is the link from a form to its handler program -- in this case, a program called test3.cgi in my web home directory, on a server machine called starship.python.net. The action option is the moral equivalent to command options in Tkinter buttons -- it's where a callback handler (here, a remote handler) is registered to the browser.

·         Input controls are specified with nested <input> tags. In this example, input tags have two key options. The type option accepts values such as text for text fields and submit for a Submit button (which sends data to the server and is labeled "Submit Query" by default). The name option is the hook used to identify the entered value by key, once all the form data reaches the server. For instance, the server-side CGI script we'll see in a moment uses the string user as a key to get the data typed into this form's text field. As we'll see in later examples, other input tag options can specify initial values (value=X), display-only mode (readonly), and so on. Other input type option values may transmit hidden data (type=hidden), reinitialize fields (type=reset), or make multiple-choice buttons (type=checkbox).

·         Forms also include a method option to specify the encoding style to be used to send data over a socket to the target server machine. Here, we use the post style, which contacts the server and then ships it a stream of user input data in a separate transmission. An alternative get style ships input information to the server in a single transmission step, by adding user inputs to the end of the URL used to invoke the script, usually after a ? character (more on this soon). With get, inputs typically show up on the server in environment variables or as arguments in the command line used to start the script. With post, they must be read from standard input and decoded. Luckily, Python's cgi module transparently handles either encoding style, so our CGI scripts don't need to know or care which is used.

Notice that the action URL in this example's form spells out the full address for illustration. Because the browser remembers where the enclosing HTML page came from, it works the same with just the script's filename, as shown in Example 12-8.

Example 12-8. PP2E\Internet\Cgi-Web\Basics\test3-minimal.html
<html><body>
<title>CGI 101</title>
<H1>A first user interaction: forms</H1>
<hr>
<form method=POST action="test3.cgi">
    <P><B>Enter your name:</B>
    <P><input type=text name=user>
    <P><input type=submit>
</form>
</BODY></HTML>

It may help to remember that URLs embedded in form action tags and hyperlinks are directions to the browser first, not the script. The test3.cgi script itself doesn't care which URL form is used to trigger it -- minimal or complete. In fact, all parts of a URL through the script filename (and up to URL query parameters) is used in the conversation between browser and HTTP server, before a CGI script is ever spawned. As long as the browser knows which server to contact, the URL will work, but URLs outside of a page (e.g., typed into a browser's address field or sent to Python's urllib module) usually must be completely specified, because there is no notion of a prior page.

12.3.4.3 Response

So far, we've created only a static page with an input field. But the Submit button on this page is loaded to work magic. When pressed, it triggers the remote program whose URL is listed in the form's action option, and passes this program the input data typed by the user, according to the form's method encoding style option. On the server, a Python script is started to handle the form's input data while the user waits for a reply on the client, as shown in Example 12-9.

Example 12-9. PP2E\Internet\Cgi-Web\Basics\test3.cgi
#!/usr/bin/python
#######################################################
# runs on the server, reads form input, prints html;
# url=http://server-name/root-dir/Basics/test3.cgi
#######################################################
 
import cgi
form = cgi.FieldStorage(  )            # parse form data
print "Content-type: text/html"      # plus blank line
 
html = """
<TITLE>test3.cgi</TITLE>
<H1>Greetings</H1>
<HR>
<P>%s</P>
<HR>"""
 
if not form.has_key('user'):
    print html % "Who are you?"
else:
    print html % ("Hello, %s." % form['user'].value)

As before, this Python CGI script prints HTML to generate a response page in the client's browser. But this script does a bit more: it also uses the standard cgi module to parse the input data entered by the user on the prior web page (see Figure 12-6). Luckily, this is all automatic in Python: a call to the cgi module's FieldStorage class automatically does all the work of extracting form data from the input stream and environment variables, regardless of how that data was passed -- in a post style stream or in get style parameters appended to the URL. Inputs sent in both styles look the same to Python scripts.

Scripts should call cgi.FieldStoreage only once and before accessing any field values. When called, we get back an object that looks like a dictionary -- user input fields from the form (or URL) show up as values of keys in this object. For example, in the script, form['user'] is an object whose value attribute is a string containing the text typed into the form's text field. If you flip back to the form page's HTML, you'll notice that the input field's name option was user -- the name in the form's HTML has become a key we use to fetch the input's value from a dictionary. The object returned by FieldStorage supports other dictionary operations, too -- for instance, the has_key method may be used to check if a field is present in the input data.

Before exiting, this script prints HTML to produce a result page that echoes back what the user typed into the form. Two string-formatting expressions (%) are used to insert the input text into a reply string, and the reply string into the triple-quoted HTML string block. The body of the script's output looks like this:

<TITLE>test3.cgi</TITLE>
<H1>Greetings</H1>
<HR>
<P>Hello, King Arthur.</P>
<HR>

In a browser, the output is rendered into a page like the one in Figure 12-7.

Figure 12-7. test3.cgi result for parameters in a form

figs/ppy2_1207.gif

12.3.4.4 Passing parameters in URLs

Notice that the URL address of the script that generated this page shows up at the top of the browser. We didn't type this URL itself -- it came from the action tag of the prior page's form HTML. However, there is nothing stopping us from typing the script's URL explicitly in our browser's address field to invoke the script, just as we did for our earlier CGI script and HTML file examples.

But there's a catch here: where does the input field's value come from if there is no form page? That is, if we type the CGI script's URL ourselves, how does the input field get filled in? Earlier, when we talked about URL formats, I mentioned that the get encoding scheme tacks input parameters onto the end of URLs. When we type script addresses explicitly, we can also append input values on the end of URLs, where they serve the same purpose as <input> fields in forms. Moreover, the Python cgi module makes URL and form inputs look identical to scripts.

For instance, we can skip filling out the input form page completely, and directly invoke our test3.cgi script by visiting a URL of the form:

http://starship.python.net/~lutz/Basics/test3.cgi?user=Brian

In this URL, a value for the input named user is specified explicitly, as if the user had filled out the input page. When called this way, the only constraint is that the parameter name user must match the name expected by the script (and hardcoded in the form's HTML). We use just one parameter here, but in general, URL parameters are typically introduced with a ? and followed by one or more name=value assignments, separated by & characters if there is more than one. Figure 12-8 shows the response page we get after typing a URL with explicit inputs.

Figure 12-8. test3.cgi result for parameters in a URL

figs/ppy2_1208.gif

In general, any CGI script can be invoked either by filling out and submitting a form page or by passing inputs at the end of a URL. When CGI scripts are invoked with explicit input parameters this way, it's difficult to not see their similarity to functions, albeit ones that live remotely on the Net. Passing data to scripts in URLs is similar to keyword arguments in Python functions, both operationally and syntactically. In fact, in Chapter 15we will meet a system called Zope that makes the relationship between URLs and Python function calls even more literal (URLs become more direct function calls).

Incidentally, if you clear out the name input field in the form input page (i.e., make it empty) and press submit, the user name field becomes empty. More accurately, the browser may not send this field along with the form data at all, even though it is listed in the form layout HTML. The CGI script detects such a missing field with the dictionary has_key method and produces the page captured in Figure 12-9 in response.

Figure 12-9. An empty name field produces an error page

figs/ppy2_1209.gif

In general, CGI scripts must check to see if any inputs are missing, partly because they might not be typed by a user in the form, but also because there may be no form at all -- input fields might not be tacked on to the end of an explicitly typed URL. For instance, if we type the script's URL without any parameters at all (i.e., omit the text ? and beyond), we get this same error response page. Since we can invoke any CGI through a form or URL, scripts must anticipate both scenarios.

12.3.5 Using Tables to Lay Out Forms

Now let's move on to something a bit more realistic. In most CGI applications, input pages are composed of multiple fields. When there is more than one, input labels and fields are typically laid out in a table, to give the form a well-structured appearance. The HTML file in Example 12-10 defines a form with two input fields.

Example 12-10. PP2E\Internet\Cgi-Web\Basics\test4.html
<html><body>
<title>CGI 101</title>
<H1>A second user interaction: tables</H1>
<hr>
<form method=POST action="test4.cgi">
  <table>
    <TR>
      <TH align=right>Enter your name:
      <TD><input type=text name=user>
    <TR>
      <TH align=right>Enter your age:
      <TD><input type=text name=age>
    <TR>
      <TD colspan=2 align=center>
      <input type=submit value="Send">
  </table>
</form>
</body></html>

The <TH> tag defines a column like <TD>, but also tags it as a header column, which generally means it is rendered in a bold font. By placing the input fields and labels in a table like this, we get an input page like that shown in Figure 12-10. Labels and inputs are automatically lined up vertically in columns much as they were by the Tkinter GUI geometry managers we met earlier in this book.

Figure 12-10. A form laid out with table tags

figs/ppy2_1210.gif

When this form's Submit button (labeled "Send" by the page's HTML) is pressed, it causes the script in Example 12-11 to be executed on the server machine, with the inputs typed by the customer.

Example 12-11. PP2E\Internet\Cgi-Web\Basics\test4.cgi
#!/usr/bin/python
#######################################################
# runs on the server, reads form input, prints html;
# url http://server-name/root-dir/Basics/test4.cgi
#######################################################
 
import cgi, sys
sys.stderr = sys.stdout              # errors to browser
form = cgi.FieldStorage(  )            # parse form data
print "Content-type: text/html\n"    # plus blank line
 
# class dummy:
#     def __init__(self, s): self.value = s
# form = {'user': dummy('bob'), 'age':dummy('10')}
 
html = """
<TITLE>test4.cgi</TITLE>
<H1>Greetings</H1>
<HR>
<H4>%s</H4>
<H4>%s</H4>
<H4>%s</H4>
<HR>"""
 
if not form.has_key('user'):
    line1 = "Who are you?"
else:
    line1 = "Hello, %s." % form['user'].value
 
line2 = "You're talking to a %s server." % sys.platform
 
line3 = ""
if form.has_key('age'):
    try:
        line3 = "Your age squared is %d!" % (int(form['age'].value) ** 2)
    except:
        line3 = "Sorry, I can't compute %s ** 2." % form['age'].value
 
print html % (line1, line2, line3)

The table layout comes from the HTML file, not this Python CGI script. In fact, this script doesn't do much new -- it uses string formatting to plug input values into the response page's HTML triple-quoted template string as before, this time with one line per input field. There are, however, a few new tricks here worth noting, especially regarding CGI script debugging and security. We'll talk about them in the next two sections.

12.3.5.1 Converting strings in CGI scripts

Just for fun, the script echoes back the name of the server platform by fetching sys.platform along with the square of the age input field. Notice that the age input's value must be converted to an integer with the built-in int function; in the CGI world, all inputs arrive as strings. We could also convert to an integer with the built-in string.atoi or eval function. Conversion (and other) errors are trapped gracefully in a try statement to yield an error line, rather than letting our script die.

You should never use eval to convert strings that were sent over the Internet like the age field in this example, unless you can be absolutely sure that the string is not even potentially malicious code. For instance, if this example were available on the general Internet, it's not impossible that someone could type a value into the age field (or append an age parameter to the URL) with a value like: os.system('rm *'). When passed to eval, such a string might delete all the files in your server script directory!

We talk about ways to minimize this risk with Python's restricted execution mode (module rexec) in Chapter 15. But by default, strings read off the Net can be very bad things to say in CGI scripting. You should never pass them to dynamic coding tools like eval and exec, or to tools that run arbitrary shell commands such as os.popen and os.system, unless you can be sure that they are safe, or unless you enable Python's restricted execution mode in your scripts.

12.3.5.2 Debugging CGI scripts

Errors happen, even in the brave new world of the Internet. Generally speaking, debugging CGI scripts can be much more difficult than debugging programs that run on your local machine. Not only do errors occur on a remote machine, but scripts generally won't run without the context implied by the CGI model. The script in Example 12-11 demonstrates the following two common debugging tricks.

Error message trapping

This script assigns sys.stderr to sys.stdout so that Python error messages wind up being displayed in the response page in the browser. Normally, Python error messages are written to stderr. To route them to the browser, we must make stderr reference the same file object as stdout (which is connected to the browser in CGI scripts). If we don't do this assignment, Python errors, including program errors in our script, never show up in the browser.

Test case mock-up

The dummy class definition, commented out in this final version, was used to debug the script before it was installed on the Net. Besides not seeing stderr messages by default, CGI scripts also assume an enclosing context that does not exist if they are tested outside the CGI environment. For instance, if run from the system command line, this script has no form input data. Uncomment this code to test from the system command line. The dummy class masquerades as a parsed form field object, and form is assigned a dictionary containing two form field objects. The net effect is that form will be plug-and-play compatible with the result of a cgi.FieldStorage call. As usual in Python, object interfaces (not datatypes) are all we must adhere to.

Here are a few general tips for debugging your server-side CGI scripts:

Run the script from the command line.

It probably won't generate HTML as is, but running it standalone will detect any syntax errors in your code. Recall that a Python command line can run source code files regardless of their extension: e.g., python somescript.cgi works fine.

Assign sys.stderr to sys.stdout as early as possible in your script.

This will make the text of Python error messages and stack dumps appear in your client browser when accessing the script. In fact, short of wading through server logs, this may be the only way to see the text of error messages after your script aborts.

Mock up inputs to simulate the enclosing CGI context.

For instance, define classes that mimic the CGI inputs interface (as done with the dummy class in this script), so that you can view the script's output for various test cases by running it from the system command line.[6] Setting environment variables to mimic form or URL inputs sometimes helps, too (we'll see how later in this chapter).

Call utilities to display CGI context in the browser.

The CGI module includes utility functions that send a formatted dump of CGI environment variables and input values to the browser (e.g., cgi.test, cgi.print_form). Sometimes, this is enough to resolve connection problems. We'll use some of these in the mailer case study in the next chapter.

Show exceptions you catch.

If you catch an exception that Python raises, the Python error message won't be printed to stderr (that is simply the default behavior). In such cases, it's up to your script to display the exception's name and value in the response page; exception details are available in the built-in sys module. We'll use this in a later example, too.

Run it live.

Of course, once your script is at least half working, your best bet is likely to start running it live on the server, with real inputs coming from a browser.

When this script is run by submitting the input form page, its output produces the new reply page shown in Figure 12-11.

Figure 12-11. Reply page generated by test4.cgi

figs/ppy2_1211.gif

As usual, we can pass parameters to this CGI script at the end of a URL, too. Figure 12-12 shows the page we get when passing a user and age explicitly in the URL. Notice that we have two parameters after the ? this time; we separate them with &. Also note that we've specified a blank space in the user value with +. This is a common URL encoding convention. On the server side, the + is automatically replaced with a space again. It's also part of the standard escape rule for URL strings, which we'll revisit later.

Figure 12-12. Reply page generated by test4.cgi for parameters in URL

figs/ppy2_1212.gif

12.3.6 Adding Common Input Devices

So far, we've been typing inputs into text fields. HTML forms support a handful of input controls (what we'd call widgets in the traditional GUI world) for collecting user inputs. Let's look at a CGI program that shows all the common input controls at once. As usual, we define both an HTML file to lay out the form page and a Python CGI script to process its inputs and generate a response. The HTML file is presented in Example 12-12.

Example 12-12. PP2E\Internet\Cgi-Web\Basics\test5a.html
<HTML><BODY>
<TITLE>CGI 101</TITLE>
<H1>Common input devices</H1>
<HR>
<FORM method=POST action="test5.cgi">
  <H3>Please complete the following form and click Send</H3>
  <P><TABLE>
    <TR>
      <TH align=right>Name:
      <TD><input type=text name=name>
    <TR>
      <TH align=right>Shoe size:
      <TD><table>
      <td><input type=radio name=shoesize value=small>Small
      <td><input type=radio name=shoesize value=medium>Medium
      <td><input type=radio name=shoesize value=large>Large
      </table>
    <TR>
      <TH align=right>Occupation:
      <TD><select name=job>
        <option>Developer
        <option>Manager
        <option>Student
        <option>Evangelist
        <option>Other
      </select>
    <TR>
      <TH align=right>Political affiliations:
      <TD><table>
      <td><input type=checkbox name=language value=Python>Pythonista
      <td><input type=checkbox name=language value=Perl>Perlmonger
      <td><input type=checkbox name=language value=Tcl>Tcler 
      </table>
    <TR>
      <TH align=right>Comments:
      <TD><textarea name=comment cols=30 rows=2>Enter text here</textarea>
    <TR>
      <TD colspan=2 align=center>
      <input type=submit value="Send">
  </TABLE>
</FORM>
<HR>
</BODY></HTML>

When rendered by a browser, the page in Figure 12-13 appears.

Figure 12-13. Form page generated by test5a.html

figs/ppy2_1213.gif

This page contains a simple text field as before, but it also has radiobuttons, a pull-down selection list, a set of multiple-choice checkbuttons, and a multiple-line text input area. All have a name option in the HTML file, which identifies their selected value in the data sent from client to server. When we fill out this form and click the Send submit button, the script in Example 12-13 runs on the server to process all the input data typed or selected in the form.

Example 12-13. PP2E\Internet\Cgi-Web\Basics\test5.cgi
#!/usr/bin/python
#######################################################
# runs on the server, reads form input, prints html;
# url=http://server-name/root-dir/Basics/test5.cgi
#######################################################
 
import cgi, sys, string
form = cgi.FieldStorage(  )            # parse form data
print "Content-type: text/html"      # plus blank line
 
html = """
<TITLE>test5.cgi</TITLE>
<H1>Greetings</H1>
<HR>
<H4>Your name is %(name)s</H4>
<H4>You wear rather %(shoesize)s shoes</H4>
<H4>Your current job: %(job)s</H4>
<H4>You program in %(language)s</H4>
<H4>You also said:</H4>
<P>%(comment)s</P>
<HR>"""
 
data = {}
for field in ['name', 'shoesize', 'job', 'language', 'comment']:
    if not form.has_key(field):
        data[field] = '(unknown)'
    else:
        if type(form[field]) != type([]):
            data[field] = form[field].value
        else:
            values = map(lambda x: x.value, form[field])
            data[field] = string.join(values, ' and ')
print html % data

This Python script doesn't do much; it mostly just copies form field information into a dictionary called data, so that it can be easily plugged into the triple-quoted response string. A few of its tricks merit explanation:

Field validation

As usual, we need to check all expected fields to see if they really are present in the input data, using the dictionary has_key method. Any or all of the input fields may be missing if they weren't entered on the form or appended to an explicit URL.

String formatting

We're using dictionary key references in the format string this time -- recall that %(name)s means pull out the value for key name in the data dictionary and perform a to-string conversion on its value.

Multiple-choice fields

We're also testing the type of all the expected fields' values to see if they arrive as a list instead of the usual string. Values of multiple-choice input controls, like the language choice field in this input page, are returned from cgi.FieldStorage as a list of objects with value attributes, rather than a simple single object with a value. This script copies simple field values to the dictionary verbatim, but uses map to collect the value fields of multiple-choice selections, and string.join to construct a single string with an and inserted between each selection value (e.g., Python and Tcl).[7]

When the form page is filled out and submitted, the script creates the response shown in Figure 12-14 -- essentially just a formatted echo of what was sent.

Figure 12-14. Response page created by test5.cgi (1)

figs/ppy2_1214.gif

12.3.6.1 Changing input layouts

Suppose that you've written a system like this, and your users, clients, and significant other start complaining that the input form is difficult to read. Don't worry. Because the CGI model naturally separates the user interface (the HTML page definition) from the processing logic (the CGI script), it's completely painless to change the form's layout. Simply modify the HTML file; there's no need to change the CGI code at all. For instance, Example 12-14 contains a new definition of the input that uses tables a bit differently to provide a nicer layout with borders.

Example 12-14. PP2E\Internet\Cgi-Web\Basics\test5b.html
<HTML><BODY>
<TITLE>CGI 101</TITLE>
<H1>Common input devices: alternative layout</H1>
<P>Use the same test5.cgi server side script, but change the 
layout of the form itself.  Notice the separation of user interface
and processing logic here; the CGI script is independent of the
HTML used to interact with the user/client.</P><HR>
 
<FORM method=POST action="test5.cgi">
  <H3>Please complete the following form and click Submit</H3>
  <P><TABLE border cellpadding=3>
    <TR>
      <TH align=right>Name:
      <TD><input type=text name=name>
    <TR>
      <TH align=right>Shoe size:
      <TD><input type=radio name=shoesize value=small>Small
          <input type=radio name=shoesize value=medium>Medium
          <input type=radio name=shoesize value=large>Large
    <TR>
      <TH align=right>Occupation:
      <TD><select name=job>
        <option>Developer
        <option>Manager
        <option>Student
        <option>Evangelist
        <option>Other
      </select>
    <TR>
      <TH align=right>Political affiliations:
      <TD><P><input type=checkbox name=language value=Python>Pythonista
          <P><input type=checkbox name=language value=Perl>Perlmonger
          <P><input type=checkbox name=language value=Tcl>Tcler 
    <TR>
      <TH align=right>Comments:
      <TD><textarea name=comment cols=30 rows=2>Enter spam here</textarea>
    <TR>
      <TD colspan=2 align=center>
      <input type=submit value="Submit">
      <input type=reset  value="Reset">
  </TABLE>
</FORM>
</BODY></HTML>

When we visit this alternative page with a browser, we get the interface shown in Figure 12-15.

Figure 12-15. Form page created by test5b.html

figs/ppy2_1215.gif

Now, before you go blind trying to detect the differences in this and the prior HTML file, I should note that the HTML differences that produce this page are much less important than the fact that the action fields in these two pages' forms reference identical URLs. Pressing this version's Submit button triggers the exact same and totally unchanged Python CGI script again, test5.cgi (Example 12-13).

That is, scripts are completely independent of the layout of the user-interface used to send them information. Changes in the response page require changing the script, of course; but we can change the input page's HTML as much as we like, without impacting the server-side Python code. Figure 12-16 shows the response page produced by the script this time around.

Figure 12-16. Response page created by test5.cgi (2)

figs/ppy2_1216.gif

12.3.7 Passing Parameters in Hardcoded URLs

Earlier, we passed parameters to CGI scripts by listing them at the end of a URL typed into the browser's address field (after a ?). But there's nothing sacred about the browser's address field. In particular, there's nothing stopping us from using the same URL syntax in hyperlinks that we hardcode in web page definitions. For example, the web page from Example 12-15 defines three hyperlinks (the text between <A> and </A> tags), which all trigger our original test5.cgi script again, but with three different precoded sets of parameters.

Example 12-15. PP2E\Internet\Cgi-Web\Basics\test5c.html
<HTML><BODY>
<TITLE>CGI 101</TITLE>
<H1>Common input devices: URL parameters</H1>
 
<P>This demo invokes the test5.cgi server-side script again,
but hardcodes input data to the end of the script's URL, 
within a simple hyperlink (instead of packaging up a form's
inputs).  Click your browser's "show page source" button 
to view the links associated with each list item below.
 
<P>This is really more about CGI than Python, but notice that 
Python's cgi module handles both this form of input (which is
also produced by GET form actions), as well as POST-ed forms; 
they look the same to the Python CGI script.  In other words, 
cgi module users are independent of the method used to submit 
data.
 
<P>Also notice that URLs with appended input values like this
can be generated as part of the page output by another CGI script, 
to direct a next user click to the right place and context; together 
with type 'hidden' input fields, they provide one way to 
save state between clicks.
</P><HR>
 
<UL>
<LI><A href="test5.cgi?name=Bob&shoesize=small">Send Bob, small</A>
<LI><A href="test5.cgi?name=Tom&language=Python">Send Tom, Python</A>
<LI><A href=
"http://starship.python.net/~lutz/Basics/test5.cgi?job=Evangelist&comment=spam">
Send Evangelist, spam</A>
</UL>
 
<HR></BODY></HTML>

This static HTML file defines three hyperlinks -- the first two are minimal and the third is fully specified, but all work similarly (again, the target script doesn't care). When we visit this file's URL, we see the page shown in Figure 12-17. It's mostly just a page for launching canned calls to the CGI script.

Figure 12-17. Hyperlinks page created by test5c.html

figs/ppy2_1217.gif

Clicking on this page's second link creates the response page in Figure 12-18. This link invokes the CGI script, with the name parameter set to "Tom" and the language parameter set to "Python," simply because those parameters and values are hardcoded in the URL listed in the HTML for the second hyperlink. It's exactly as if we had manually typed the line shown at the top of the browser in Figure 12-18.

Figure 12-18. Response page created by test5.cgi (3)

figs/ppy2_1218.gif

Notice that lots of fields are missing here; the test5.cgi script is smart enough to detect and handle missing fields and generate an unknown message in the reply page. It's also worth pointing out that we're reusing the Python CGI script again here. The script itself is completely independent of both the user-interface format of the submission page, as well as the technique used to invoke it (from a submitted form or a hardcoded URL). By separating user interface from processing logic, CGI scripts become reusable software components, at least within the context of the CGI environment.

12.3.7.1 Saving CGI script state information

But the real reason for showing this technique is that we're going to use it extensively in the larger case studies in the next two chapters to implement lists of dynamically generated selections that "know" what to do when clicked. Precoded parameters in URLs are a way to retain state information between pages -- they can be used to direct the action of the next script to be run. As such, hyperlinks with such parameters are sometimes known as "smart links."

Normally, CGI scripts run autonomously, with no knowledge of any other scripts that may have run before. That hasn't mattered in our examples so far, but larger systems are usually composed of multiple user interaction steps and many scripts, and we need a way to keep track of information gathered along the way. Generating hardcoded URLs with parameters is one way for a CGI script to pass data to the next script in the application. When clicked, such URL parameters send pre-programmed selection information back to another server-side handler script.

For example, a site that lets you read your email may present you with a list of viewable email messages, implemented in HTML as a list of hyperlinks generated by another script. Each hyperlink might include the name of the message viewer script, along with parameters identifying the selected message number, email server name, and so on -- as much data as is needed to fetch the message associated with a particular link. A retail site may instead serve up a generated list of product links, each of which triggers a hardcoded hyperlink containing the product number, its price, and so on.

In general, there are a variety of ways to pass or retain state information between CGI script executions:

·         Hardcoded URL parameters in dynamically generated hyperlinks and embedded in web pages (as discussed here)

·         Hidden form input fields that are attached to form data and embedded in web pages, but not displayed on web pages

·         HTTP "cookies" that are stored on the client machine and transferred between client and server in HTTP message headers

·         General server-side data stores that include databases, persistent object shelves, flat files, and so on

We'll meet most of these mediums in later examples in this chapter and in the two chapters that follow.

12.4 The Hello World Selector

It's now time for something a bit more useful (well, more entertaining, at least). This section presents a program that displays the basic syntax required by various programming languages to print the string "Hello World", the classic language benchmark. To keep this simple, it assumes the string shows up in the standard output stream, not a GUI or web page. It also gives just the output command itself, not the complete programs. The Python version happens to be a complete program, but we won't hold that against its competitors here.

Structurally, the first cut of this example consists of a main page HTML file, along with a Python-coded CGI script that is invoked by a form in the main HTML page. Because no state or database data is stored between user clicks, this is still a fairly simple example. In fact, the main HTML page implemented by Example 12-16 is really just one big pull-down selection list within a form.

Example 12-16. PP2E\Internet\Cgi-Web\Basics\languages.html
<html><body>
<title>Languages</title>
<h1>Hello World selector</h1>
 
<P>This demo shows how to display a "hello world" message in various
programming languages' syntax.  To keep this simple, only the output command
is shown (it takes more code to make a complete program in some of these 
languages), and only text-based solutions are given (no GUI or HTML 
construction logic is included). This page is a simple HTML file; the one 
you see after pressing the button below is generated by a Python CGI script 
which runs on the server. Pointers: 
 
<UL>
<LI>To see this page's HTML, use the 'View Source' command in your browser.
<LI>To view the Python CGI script on the server, 
    <A HREF="languages-src.cgi">click here</A> or
    <A HREF="getfile.cgi?filename=languages.cgi">here</A>. 
<LI>To see an alternative version that generates this page dynamically, 
    <A HREF="languages2.cgi">click here</A>. 
<LI>For more syntax comparisons, visit 
    <A HREF="http://www.ionet.net/~timtroyr/funhouse/beer.html">this site</A>.
</UL></P>
 
<hr>
<form method=POST action="languages.cgi">
    <P><B>Select a programming language:</B>
    <P><select name=language>
        <option>All
        <option>Python
        <option>Perl
        <option>Tcl
        <option>Scheme
        <option>SmallTalk
        <option>Java
        <option>C
        <option>C++
        <option>Basic
        <option>Fortran
        <option>Pascal
        <option>Other
    </select>
    <P><input type=Submit>
</form>
 
</body></html>

For the moment, let's ignore some of the hyperlinks near the middle of this file; they introduce bigger concepts like file transfers and maintainability that we will explore in the next two sections. When visited with a browser, this HTML file is downloaded to the client and rendered into the new browser page shown in Figure 12-19.

Figure 12-19. The "Hello World" main page

figs/ppy2_1219.gif

That widget above the Submit button is a pull-down selection list that lets you choose one of the <option> tag values in the HTML file. As usual, selecting one of these language names and pressing the Submit button at the bottom (or pressing your Enter key) sends the selected language name to an instance of the server-side CGI script program named in the form's action option. Example 12-17 contains the Python script that runs on the server upon submission.

Example 12-17. PP2E\Internet\Cgi-Web\Basics\languages.cgi
#!/usr/bin/python
########################################################
# show hello world syntax for input language name;
# note that it uses r'...' raw strings so that '\n'
# in the table are left intact, and cgi.escape(  ) on 
# the string so that things like '<<' don't confuse 
# browsers--they are translated to valid html code;
# any language name can arrive at this script: e.g.,
# can type "http://starship.python.net/~lutz/Basics
# /languages.cgi?language=Cobol" in any web browser.
# caveats: the languages list appears in both the cgi
# and html files--could import from a single file if
# selection list generated by another cgi script too;
########################################################
 
debugme  = 0                                     # 1=test from cmd line
inputkey = 'language'                            # input parameter name 
 
hellos = {
    'Python':    r" print 'Hello World'               ",
    'Perl':      r' print "Hello World\n";            ',
    'Tcl':       r' puts "Hello World"                ',
    'Scheme':    r' (display "Hello World") (newline) ',
    'SmallTalk': r" 'Hello World' print.              ",
    'Java':      r' System.out.println("Hello World"); ',
    'C':         r' printf("Hello World\n");          ',
    'C++':       r' cout << "Hello World" << endl;    ',
    'Basic':     r' 10 PRINT "Hello World"            ',
    'Fortran':   r" print *, 'Hello World'             ",
    'Pascal':    r" WriteLn('Hello World');            "
}
 
class dummy:                                     # mocked-up input obj  
    def __init__(self, str): self.value = str
 
import cgi, sys
if debugme:
    form = {inputkey: dummy(sys.argv[1])}        # name on cmd line
else:
    form = cgi.FieldStorage(  )                    # parse real inputs
 
print "Content-type: text/html\n"                # adds blank line
print "<TITLE>Languages</TITLE>"
print "<H1>Syntax</H1><HR>" 
 
def showHello(form):                             # html for one language
    choice = form[inputkey].value 
    print "<H3>%s</H3><P><PRE>" % choice
    try:
        print cgi.escape(hellos[choice])
    except KeyError:
        print "Sorry--I don't know that language"
    print "</PRE></P><BR>"
    
if not form.has_key(inputkey) or form[inputkey].value == 'All':
    for lang in hellos.keys(  ):
        mock = {inputkey: dummy(lang)}
        showHello(mock)
else:
    showHello(form)
print '<HR>' 

And as usual, this script prints HTML code to the standard output stream to produce a response page in the client's browser. There's not much new to speak of in this script, but it employs a few techniques that merit special focus:

Raw strings

Notice the use of raw strings (string constants preceded by an "r" character) in the language syntax dictionary. Recall that raw strings retain \ backslash characters in the string literally, rather than interpreting them as string escape-code introductions. Without them, the \n newline character sequences in some of the language's code snippets would be interpreted by Python as line-feeds, rather than being printed in the HTML reply as \n.

Escaping text embedded in HTML and URLs

This script takes care to format the text of each language's code snippet with the cgi.escape utility function. This standard Python utility automatically translates characters that are special in HTML into HTML escape code sequences, such that they are not treated as HTML operators by browsers. Formally, cgi.escape translates characters to escape code sequences, according to the standard HTML convention: <, >, and & become &lt;, &gt;, and &amp;. If you pass a second true argument, the double-quote character (") is also translated to &quot;.

For example, the << left-shift operator in the C++ entry is translated to &lt;&lt; -- a pair of HTML escape codes. Because printing each code snippet effectively embeds it in the HTML response stream, we must escape any special HTML characters it contains. HTML parsers (including Python's standard htmllib module) translate escape codes back to the original characters when a page is rendered.

More generally, because CGI is based upon the notion of passing formatted stringsacross the Net, escaping special characters is a ubiquitous operation. CGI scripts almost always need to escape text generated as part of the reply to be safe. For instance, if we send back arbitrary text input from a user or read from a data source on the server, we usually can't be sure if it will contain HTML characters or not, so we must escape it just in case.

In later examples, we'll also find that characters inserted into URL address strings generated by our scripts may need to be escaped as well. A literal & in a URL is special, for example, and must be escaped if it appears embedded in text we insert into a URL. However, URL syntax reserves different special characters than HTML code, and so different escaping conventions and tools must be used. As we'll see later in this chapter, cgi.escape implements escape translations in HTML code, but urllib.quote (and its relatives) escapes characters in URL strings.

Mocking up form inputs

Here again, form inputs are "mocked up" (simulated), both for debugging and for responding to a request for all languages in the table. If the script's global debugme variable is set to a true value, for instance, the script creates a dictionary that is plug-and-play compatible with the result of a cgi.FieldStorage call -- its "languages" key references an instance of the dummy mock-up class. This class in turn creates an object that has the same interface as the contents of a cgi.FieldStorage result -- it makes an object with a value attribute set to a passed-in string.

The net effect is that we can test this script by running it from the system command line: the generated dictionary fools the script into thinking it was invoked by a browser over the Net. Similarly, if the requested language name is "All," the script iterates over all entries in the languages table, making a mocked-up form dictionary for each (as though the user had requested each language in turn). This lets us reuse the existing showHello logic to display each language's code in a single page. As always in Python, object interfaces and protocols are what we usually code for, not specific datatypes. The showHello function will happily process any object that responds to the syntax form['language'].value.[8]

Now let's get back to interacting with this program. If we select a particular language, our CGI script generates an HTML reply of the following sort (along with the required content-type header and blank line):

<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>
<H3>Scheme</H3><P><PRE>
 (display "Hello World") (newline) 
</PRE></P><BR>
<HR>

Program code is marked with a <PRE> tag to specify preformatted text (the browser won't reformat it like a normal text paragraph). This reply code shows what we get when we pick "Scheme." Figure 12-20 shows the page served up by the script after selecting "Python" in the pull-down selection list.

Figure 12-20. Response page created by languages.cgi

figs/ppy2_1220.gif

Our script also accepts a language name of "All," and interprets it as a request to display the syntax for every language it knows about. For example, here is the HTML that is generated if we set global variable debugme to 1 and run from the command line with a single argument, "All." This output is the same as what's printed to the client's browser in response to an "All" request:[9]

C:\...\PP2E\Internet\Cgi-Web\Basics>python languages.cgi All
Content-type: text/html
 
<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>
<H3>Perl</H3><P><PRE>
 print "Hello World\n";            
</PRE></P><BR>
<H3>SmallTalk</H3><P><PRE>
 'Hello World' print.              
</PRE></P><BR>
<H3>Basic</H3><P><PRE>
 10 PRINT "Hello World"            
</PRE></P><BR>
<H3>Scheme</H3><P><PRE>
 (display "Hello World") (newline) 
</PRE></P><BR>
<H3>Python</H3><P><PRE>
 print 'Hello World'               
</PRE></P><BR>
<H3>C++</H3><P><PRE>
 cout &lt;&lt; "Hello World" &lt;&lt; endl;    
</PRE></P><BR>
<H3>Pascal</H3><P><PRE>
 WriteLn('Hello World');            
</PRE></P><BR>
<H3>Java</H3><P><PRE>
 System.out.println("Hello World"); 
</PRE></P><BR>
<H3>C</H3><P><PRE>
 printf("Hello World\n");          
</PRE></P><BR>
<H3>Tcl</H3><P><PRE>
 puts "Hello World"                
</PRE></P><BR>
<H3>Fortran</H3><P><PRE>
 print *, 'Hello World'             
</PRE></P><BR>
<HR>

Each language is represented here with the same code pattern -- the showHello function is called for each table entry, along with a mocked-up form object. Notice the way that C++ code is escaped for embedding inside the HTML stream; this is the cgi.escape call's handiwork. When viewed with a browser, the "All" response page is rendered as shown in Figure 12-21.

Figure 12-21. Response page for "all languages" choice

figs/ppy2_1221.gif

12.4.1 Checking for Missing and Invalid Inputs

So far, we've been triggering the CGI script by selecting a language name from the pull-down list in the main HTML page. In this context, we can be fairly sure that the script will receive valid inputs. Notice, though, that there is nothing to prevent a user from passing the requested language name at the end of the CGI script's URL as an explicit parameter, instead of using the HTML page form. For instance, a URL of the form:

http://starship.python.net/~lutz/Basics/languages.cgi?language=Python

yields the same "Python" response page shown in Figure 12-20.[10] However, because it's always possible for a user to bypass the HTML file and use an explicit URL, it's also possible that a user could invoke our script with an unknown language name that is not in the HTML file's pull-down list (and so not in our script's table). In fact, the script might be triggered with no language input at all, if someone explicitly types its URL with no parameter at the end.

To be robust, the script checks for both cases explicitly, as all CGI scripts generally should. For instance, here is the HTML generated in response to a request for the fictitious language "GuiDO":

<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>
<H3>GuiDO</H3><P><PRE>
Sorry--I don't know that language
</PRE></P><BR>
<HR>

If the script doesn't receive any language name input, it simply defaults to the "All" case. If we didn't detect these cases, chances are that our script would silently die on a Python exception and leave the user with a mostly useless half-complete page or with a default error page (we didn't assign stderr to stdout here, so no Python error message would be displayed). In pictures, Figure 12-22 shows the page generated if the script is invoked with an explicit URL like this:

http://starship.python.net/~lutz/Basics/languages.cgi?language=COBOL

To test this error case, the pull-down list includes an "Unknown" name, which produces a similar error page reply. Adding code to the script's table for the COBOL "Hello World" program is left as an exercise for the reader.

Figure 12-22. Response page for unknown language

figs/ppy2_1222.gif

12.5 Coding for Maintainability

Let's step back from coding details for just a moment to gain some design perspective. As we've seen, Python code, by and large, automatically lends itself to systems that are easy to read and maintain; it has a simple syntax that cuts much of the clutter of other tools. On the other hand, coding styles and program design can often impact maintainability as much as syntax. For example, the "Hello World" selector pages earlier in this chapter work as advertised, and were very easy and fast to throw together. But as currently coded, the languages selector suffers from substantial maintainability flaws.

Imagine, for instance, that you actually take me up on that challenge posed at the end of the last section, and you attempt to add another entry for COBOL. If you add COBOL to the CGI script's table, you're only half done: the list of supported languages lives redundantly in two places -- in the HTML for the main page as well as the script's syntax dictionary. Changing one does not change the other. More generally, there are a handful of ways that this program might fail the scrutiny of a rigorous code review:

Selection list

As just mentioned, the list of languages supported by this program lives in two places: the HTML file and the CGI script's table.

Field name

The field name of the input parameter, "language," is hardcoded into both files, as well. You might remember to change it in the other if you change it in one, but you might not.

Form mock ups

We've redundantly coded classes to mock-up form field inputs twice in this chapter already; the "dummy" class here is clearly a mechanism worth reusing.

HTML Code

HTML embedded in and generated by the script is sprinkled throughout the program in print statements, making it difficult to implement broad web page layout changes.

This is a short example, of course, but issues of redundancy and reuse become more acute as your scripts grow larger. As a rule of thumb, if you find yourself changing multiple source files to modify a single behavior, or if you notice that you've taken to writing programs by cut-and-paste copying of existing code, then it's probably time to think about more rational program structures. To illustrate coding styles and practices that are more friendly to maintainers, let's rewrite this example to fix all of these weaknesses in a single mutation.

12.5.1 Step 1: Sharing Objects Between Pages

We can remove the first two maintenance problems listed above with a simple transformation; the trick is to generate the main page dynamically, from an executablescript, rather than from a precoded HTML file. Within a script, we can import the input field name and selection list values from a common Python module file, shared by the main and reply page generation scripts. Changing the selection list or field name in the common module changes both clients automatically. First, we move shared objects to a common module file, as shown in Example 12-18.

Example 12-18. PP2E\Internet\Cgi-Web\Basics\languages2common.py
########################################################
# common objects shared by main and reply page scripts;
# need change only this file to add a new language.
########################################################
 
inputkey = 'language'                            # input parameter name 
 
hellos = {
    'Python':    r" print 'Hello World'               ",
    'Perl':      r' print "Hello World\n";            ',
    'Tcl':       r' puts "Hello World"                ',
    'Scheme':    r' (display "Hello World") (newline) ',
    'SmallTalk': r" 'Hello World' print.              ",
    'Java':      r' System.out.println("Hello World"); ',
    'C':         r' printf("Hello World\n");          ',
    'C++':       r' cout << "Hello World" << endl;    ',
    'Basic':     r' 10 PRINT "Hello World"            ',
    'Fortran':   r" print *, 'Hello World'             ",
    'Pascal':    r" WriteLn('Hello World');            "
}

Module languages2common contains all the data that needs to agree between pages: the field name, as well as the syntax dictionary. The hellos syntax dictionary isn't quite HTML code, but its keys list can be used to generate HTML for the selection list on the main page dynamically. Next, in Example 12-19, we recode the main page as an executable script, and populate the response HTML with values imported from the common module file in the previous example.

Example 12-19. PP2E\Internet\Cgi-Web\Basics\languages2.cgi
#!/usr/bin/python
#################################################################
# generate html for main page dynamically from an executable
# Python script, not a pre-coded HTML file; this lets us 
# import the expected input field name and the selection table 
# values from a common Python module file; changes in either 
# now only have to be made in one place, the Python module file;
#################################################################
 
REPLY = """Content-type: text/html
 
<html><body>
<title>Languages2</title>
<h1>Hello World selector</h1>
<P>Similar to file <a href="languages.html">languages.html</a>, but 
this page is dynamically generated by a Python CGI script, using 
selection list and input field names imported from a common Python 
module on the server. Only the common module must be maintained as 
new languages are added, because it is shared with the reply script.
 
To see the code that generates this page and the reply, click
<a href="getfile.cgi?filename=languages2.cgi">here</a>, 
<a href="getfile.cgi?filename=languages2reply.cgi">here</a>, 
<a href="getfile.cgi?filename=languages2common.py">here</a>, and
<a href="getfile.cgi?filename=formMockup.py">here</a>.</P>
<hr>
<form method=POST action="languages2reply.cgi">
    <P><B>Select a programming language:</B>
    <P><select name=%s>
        <option>All
        %s
        <option>Other
    </select>
    <P><input type=Submit>
</form>
</body></html>
"""
 
import string
from languages2common import hellos, inputkey
 
options = []
for lang in hellos.keys(  ):
    options.append('<option>' + lang)      # wrap table keys in html code
options = string.join(options, '\n\t')
print REPLY % (inputkey, options)          # field name and values from module

Here again, ignore the getfile hyperlinks in this file for now; we'll learn what they mean in the next section. You should notice, though, that the HTML page definition becomes a printed Python string here (named REPLY), with %s format targets where we plug in values imported from the common module.[11] It's otherwise similar to the original HTML file's code; when we visit this script's URL, we get a similar page, shown in Figure 12-23. But this time, the page is generated by running a script on the server that populates the pull-down selection list from the keys list of the common syntax table.

Figure 12-23. Alternative main page made by languages2.cgi

figs/ppy2_1223.gif

12.5.2 Step 2: A Reusable Form Mock-up Utility

Moving the languages table and input field name to a module file solves the first two maintenance problems we noted. But if we want to avoid writing a dummy field mock-up class in every CGI script we write, we need to do something more. Again, it's merely a matter of exploiting the Python module's affinity for code reuse: let's move the dummy class to a utility module, as in Example 12-20.

Example 12-20. PP2E\Internet\Cgi-Web\Basics\formMockup.py
##############################################################
# Tools for simulating the result of a cgi.FieldStorage(  ) 
# call; useful for testing CGI scripts outside the web
##############################################################
 
import types
 
class FieldMockup:                                   # mocked-up input object
    def __init__(self, str): 
        self.value = str
 
def formMockup(**kwargs):                            # pass field=value args
    mockup = {}                                      # multi-choice: [value,...]
    for (key, value) in kwargs.items(  ):
        if type(value) is not types.ListType:        # simple fields have .value
            mockup[key] = FieldMockup(str(value))
        else:                                        # multi-choice have list
            mockup[key] = []                         # to do: file upload fields
            for pick in value:
                mockup[key].append(FieldMockup(pick))
    return mockup
 
def selftest(  ):
    # use this form if fields can be hard-coded
    form = formMockup(name='Bob', job='hacker', food=['Spam', 'eggs', 'ham'])
    print form['name'].value
    print form['job'].value
    for item in form['food']:
        print item.value,
    # use real dict if keys are in variables or computed
    print
    form = {'name':FieldMockup('Brian'), 'age':FieldMockup(38)}
    for key in form.keys(  ):
        print form[key].value
 
if __name__ == '__main__': selftest(  )

By placing our mock-up class in this module, formMockup.py, it automatically becomes a reusable tool, and may be imported by any script we care to write.[12] For readability, the dummy field simulation class has been renamed FieldMockup here. For convenience, we've also added a formMockup utility function that builds up an entire form dictionary from passed-in keyword arguments. Assuming you can hardcode the names of the form to be faked, the mock-up can be created in a single call. This module includes a self-test function invoked when the file is run from the command line, which demonstrates how its exports are used. Here is its test output, generated by making and querying two form mock-up objects:

C:\...\PP2E\Internet\Cgi-Web\Basics>python formMockup.py
Bob
hacker
Spam eggs ham
38
Brian

Since the mock-up now lives in a module, we can reuse it any time we want to test a CGI script offline. To illustrate, the script in Example 12-21 is a rewrite of the test5.cgi example we saw earlier, using the form mock-up utility to simulate field inputs. If we had planned ahead, we could have tested this script like this without even needing to connect to the Net.

Example 12-21. PP2E\Internet\Cgi-Web\Basics\test5_mockup.cgi
#!/usr/bin/python
##################################################################
# run test5 logic with formMockup instead of cgi.FieldStorage(  )
# to test: python test5_mockup.cgi > temp.html, and open temp.html
##################################################################
 
from formMockup import formMockup
form = formMockup(name='Bob',
                  shoesize='Small',
                  language=['Python', 'C++', 'HTML'], 
                  comment='ni, Ni, NI')
 
# rest same as original, less form assignment

Running this script from a simple command line shows us what the HTML response stream will look like:

C:\...\PP2E\Internet\Cgi-Web\Basics>python test5_mockup.cgi
Content-type: text/html
 
<TITLE>test5.cgi</TITLE>
<H1>Greetings</H1>
<HR>
<H4>Your name is Bob</H4>
<H4>You wear rather Small shoes</H4>
<H4>Your current job: (unknown)</H4>
<H4>You program in Python and C++ and HTML</H4>
<H4>You also said:</H4>
<P>ni, Ni, NI</P>
<HR>

Running it live yields the page in Figure 12-24. Field inputs here are hardcoded, similar in spirit to the test5 extension that embedded input parameters at the end of hyperlink URLs. Here, they come from form mock-up objects created in the reply script that cannot be changed without editing the script. Because Python code runs immediately, though, modifying a Python script during the debug cycle goes as quickly as you can type.

Figure 12-24. A response page with simulated inputs

figs/ppy2_1224.gif

12.5.3 Step 3: Putting It All Together -- A New Reply Script

There's one last step on our path to software maintenance nirvana: we still must recode the reply page script itself, to import data that was factored out to the common module and import the reusable form mock-up module's tools. While we're at it, we move code into functions (in case we ever put things in this file that we'd like to import in another script), and all HTML code to triple-quoted string blocks (see Example 12-22). Changing HTML is generally easier when it has been isolated in single strings like this, rather than being sprinkled throughout a program.

Example 12-22. PP2E\Internet\Cgi-Web\Basics\languages2reply.cgi
#!/usr/bin/python
#########################################################
# for easier maintenance, use html template strings, get
# the language table and input key from common mdule file,
# and get reusable form field mockup utilities module.
#########################################################
 
import cgi, sys
from formMockup import FieldMockup                   # input field simulator
from languages2common import hellos, inputkey        # get common table, name
debugme = 0
 
hdrhtml = """Content-type: text/html\n
<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>"""
 
langhtml = """
<H3>%s</H3><P><PRE>
%s
</PRE></P><BR>"""
 
def showHello(form):                                 # html for one language
    choice = form[inputkey].value                    # escape lang name too
    try:
        print langhtml % (cgi.escape(choice),
                          cgi.escape(hellos[choice]))
    except KeyError:
        print langhtml % (cgi.escape(choice), 
                         "Sorry--I don't know that language")
 
def main(  ):
    if debugme:
        form = {inputkey: FieldMockup(sys.argv[1])}  # name on cmd line
    else:
        form = cgi.FieldStorage(  )                    # parse real inputs
    
    print hdrhtml
    if not form.has_key(inputkey) or form[inputkey].value == 'All':
        for lang in hellos.keys(  ):
            mock = {inputkey: FieldMockup(lang)}
            showHello(mock)
    else:
        showHello(form)
    print '<HR>' 
 
if __name__ == '__main__': main(  )

When global debugme is set to 1, the script can be tested offline from a simple command line as before:

C:\...\PP2E\Internet\Cgi-Web\Basics>python languages2reply.cgi Python
Content-type: text/html
 
<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>
 
<H3>Python</H3><P><PRE>
 print 'Hello World'
</PRE></P><BR>
<HR>

When run online, we get the same reply pages we saw for the original version of this example (we won't repeat them here again). This transformation changed the program's architecture, not its user interface.

Most of the code changes in this version of the reply script are straightforward. If you test-drive these pages, the only differences you'll find are the URLs at the top of your browser (they're different files, after all), extra blank lines in the generated HTML (ignored by the browser), and a potentially different ordering of language names in the main page's pull-down selection list.

This selection list ordering difference arises because this version relies on the order of the Python dictionary's keys list, not on a hardcoded list in an HTML file. Dictionaries, you'll recall, arbitrarily order entries for fast fetches; if you want the selection list to be more predictable, simply sort the keys list before iterating over it using the list sort method.

Faking Inputs with Shell Variables

If you know what you're doing, you can sometimes also test CGI scripts from the command line by setting the same environment variables that HTTP servers set, and then launching your script. For example, we can pretend to be a web server by storing input parameters in the QUERY_STRING environment variable, using the same syntax we employ at the end of a URL string after the ?:

$ setenv QUERY_STRING "name=Mel&job=trainer,+writer"
$ python test5.cgi
Content-type: text/html
 
<TITLE>test5.cgi<?TITLE>
<H1>Greetings</H1>
<HR>
<H4>Your name is Mel</H4>
<H4>You wear rather (unknown) shoes</H4>
<H4>Your current job: trainer, writer</H4>
<H4>You program in (unknown)</H4>
<H4>You also said:</H4>
<P>(unknown)</P>
<HR>

Here, we mimic the effects of a GET style form submission or explicit URL. HTTP servers place the query string (parameters) in the shell variable QUERY_STRING. Python's cgi module finds them there as though they were sent by a browser. POST-style inputs can be simulated with shell variables, too, but it's more complex -- so much so that you're likely best off not learning how. In fact, it may be more robust in general to mock-up inputs with Python objects (e.g., as in formMockup.py). But some CGI scripts may have additional environment or testing constraints that merit unique treatment.

12.6 More on HTML and URL Escapes

Perhaps the most subtle change in the last section's rewrite is that, for robustness, this version also calls cgi.escape for the language name, not just for the language's code snippet. It's unlikely but not impossible that someone could pass the script a language name with an embedded HTML character. For example, a URL like:

http://starship.python.net/~lutz/Basics/languages2reply.cgi?language=a<b

embeds a < in the language name parameter (the name is a<b). When submitted, this version uses cgi.escape to properly translate the < for use in the reply HTML, according to the standard HTML escape conventions discussed earlier:

<TITLE>Languages</TITLE>
<H1>Syntax</H1><HR>
 
<H3>a&lt;b</H3><P><PRE>
Sorry--I don't know that language
</PRE></P><BR>
<HR>

The original version doesn't escape the language name, such that the embedded <b is interpreted as an HTML tag (which may make the rest of the page render in bold font!). As you can probably tell by now, text escapes are pervasive in CGI scripting -- even text that you may think is safe must generally be escaped before being inserted into the HTML code in the reply stream.

12.6.1 URL Escape Code Conventions

Notice, though, that while it's wrong to embed an unescaped < in the HTML code reply, it's perfectly okay to include it literally in the earlier URL string used to trigger the reply. In fact, HTML and URLs define completely different characters as special. For instance, although & must be escaped as &amp inside HTML code, we have to use other escaping schemes to code a literal & within a URL string (where it normally separates parameters). To pass a language name like a&b to our script, we have to type the following URL:

http://starship.python.net/~lutz/Basics/languages2reply.cgi?language=a%26b

Here, %26 represents & -- the & is replaced with a % followed by the hexadecimal value (0x26) of its ASCII code value (38). By URL standard, most nonalphanumeric characters are supposed to be translated to such escape sequences, and spaces are replaced by + signs. Technically, this convention is known as the application/x-www-form-urlencoded query string format, and it's part of the magic behind those bizarre URLs you often see at the top of your browser as you surf the Web.

12.6.2 Python HTML and URL Escape Tools

If you're like me, you probably don't have the hexadecimal value of the ASCII code for & committed to memory. Luckily, Python provides tools that automatically implement URL escapes, just as cgi.escape does for HTML escapes. The main thing to keep in mind is that HTML code and URL strings are written with entirely different syntax, and so they employ distinct escaping conventions. Web users don't generally care, unless they need to type complex URLs explicitly (browsers handle most escape code details internally). But if you write scripts that must generate HTML or URLs, you need to be careful to escape characters that are reserved in either syntax.

Because HTML and URLs have different syntaxes, Python provides two distinct sets of tools for escaping their text. In the standard Python library:

·         cgi.escape escapes text to be embedded in HTML.

·         urllib.quote and quote_plus escape text to be embedded in URLs.

The urllib module also has tools for undoing URL escapes (unquote, unquote_plus), but HTML escapes are undone during HTML parsing at large (htmllib). To illustrate the two escape conventions and tools, let's apply each toolset to a few simple examples.

12.6.3 Escaping HTML Code

As we saw earlier, cgi.escape translates code for inclusion within HTML. We normally call this utility from a CGI script, but it's just as easy to explore its behavior interactively:

>>> import cgi
>>> cgi.escape('a < b > c & d "spam"', 1)
'a &lt; b &gt; c &amp; d &quot;spam&quot;'
 
>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'

Python's cgi module automatically converts characters that are special in HTML syntax according to the HTML convention. It translates <, >, &, and with an extra true argument, ", into escape sequences of the form &X;, where the X is a mnemonic that denotes the original character. For instance, &lt; stands for the "less than" operator (<) and &amp; denotes a literal ampersand (&).

There is no un-escaping tool in the CGI module, because HTML escape code sequences are recognized within the context of an HTML parser, like the one used by your web browser when a page is downloaded. Python comes with a full HTML parser, too, in the form of standard module htmllib, which imports and specializes tools in module sgmllib (HTML is a kind of SGML syntax). We won't go into details on the HTML parsing tools here (see the library manual for details), but to illustrate how escape codes are eventually undone, here is the SGML module at work reading back the last output above:

>>> from sgmllib import TestSGMLParser
>>> p = TestSGMLParser(1)
>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'
>>> for c in s:
...    p.feed(c)
...
>>> p.close(  )
data: '1<2 <b>hello</b>'

12.6.4 Escaping URLs

By contrast, URLs reserve other characters as special and must adhere to different escape conventions. Because of that, we use different Python library tools to escape URLs for transmission. Python's urllib module provides two tools that do the translation work for us: quote, which implements the standard %XX hexadecimal URL escape code sequences for most nonalphanumeric characters, and quote_plus, which additionally translates spaces to + plus signs. The urllib module also provides functions for unescaping quoted characters in a URL string: unquote undoes %XX escapes, and unquote_plus also changes plus signs back to spaces. Here is the module at work, at the interactive prompt:

>>> import urllib
>>> urllib.quote("a & b #! c")
'a%20%26%20b%20%23%21%20c'
 
>>> urllib.quote_plus("C:\stuff\spam.txt")
'C%3a%5cstuff%5cspam.txt'
 
>>> x = urllib.quote_plus("a & b #! c")
>>> x
'a+%26+b+%23%21+c'
 
>>> urllib.unquote_plus(x)
'a & b #! c'

URL escape sequences embed the hexadecimal values of non-safe characters following a % sign (usually, their ASCII codes). In urllib, non-safe characters are usually taken to include everything except letters, digits, a handful of safe special characters (any of _,.-), and / by default). You can also specify a string of safe characters as an extra argument to the quote calls to customize the translations; the argument defaults to /, but passing an empty string forces / to be escaped:

>>> urllib.quote_plus("uploads/index.txt")
'uploads/index.txt'
 
>>> urllib.quote_plus("uploads/index.txt", '')
'uploads%2findex.txt'

Note that Python's cgi module also translates URL escape sequences back to their original characters and changes + signs to spaces during the process of extracting input information. Internally, cgi.FieldStorage automatically calls urllib.unquote if needed to parse and unescape parameters passed at the end of URLs (most of the translation happens in cgi.parse_qs). The upshot is that CGI scripts get back the original, unescaped URL strings, and don't need to unquote values on their own. As we've seen, CGI scripts don't even need to know that inputs came from a URL at all.

12.6.5 Escaping URLs Embedded in HTML Code

But what do we do for URLs inside HTML? That is, how do we escape when we generate and embed text inside a URL, which is itself embedded inside generated HTML code? Some of our earlier examples used hardcoded URLs with appended input parameters inside <A HREF> hyperlink tags; file languages2.cgi, for instance, prints HTML that includes a URL:

<a href="getfile.cgi?filename=languages2.cgi">

Because the URL here is embedded in HTML, it must minimally be escaped according to HTML conventions (e.g., any < characters must become &lt;), and any spaces should be translated to + signs. A cgi.escape(url) call, followed by a string.replace(url, " ", "+") would take us this far, and would probably suffice for most cases.

That approach is not quite enough in general, though, because HTML escaping conventions are not the same as URL conventions. To robustly escape URLs embedded in HTML code, you should instead call urllib.quote_plus on the URL string before adding it to the HTML text. The escaped result also satisfies HTML escape conventions, because urllib translates more characters than cgi.escape, and the % in URL escapes is not special to HTML.

But there is one more wrinkle here: you also have to be careful with & characters in URL strings that are embedded in HTML code (e.g., within <A> hyperlink tags). Even if parts of the URL string are URL-escaped, when more than one parameter is separated by a &, the & separator might also have to be escaped as &amp; according to HTML conventions. To see why, consider the following HTML hyperlink tag:

<A HREF="file.cgi?name=a&job=b&amp=c&sect=d&lt=e">hello</a>

When rendered in most browsers I've tested, this URL link winds up looking incorrectly like this (the "S" character is really a non-ASCII section marker):

file.cgi?name=a&job=b&=c&S=d<=e

The first two parameters are retained as expected (name=a, job=b), because name is not preceded with an &, and &job is not recognized as a valid HTML character escape code. However, the &amp, &sect, and &lt parts are interpreted as special characters, because they do name valid HTML escape codes. To make this work as expected, the & separators should be escaped:

<A HREF="file.cgi?name=a&amp;job=b&amp;amp=c&amp;sect=d&amp;lt=e">hello</a>

Browsers render this fully escaped link as expected:

file.cgi?name=a&job=b&amp=c&sect=d&lt=e

The moral of this story is that unless you can be sure that the names of all but the leftmost URL query parameters embedded in HTML are not the same as the name of any HTML character escape code like amp, you should generally run the entire URL through cgi.escape after escaping its parameter names and values with urllib.quote_plus:

>>> import cgi
>>> cgi.escape('file.cgi?name=a&job=b&amp=c&sect=d&lt=e')
'file.cgi?name=a&amp;job=b&amp;amp=c&amp;sect=d&amp;lt=e'

Having said that, I should add that some examples in this book do not escape & URL separators embedded within HTML simply because their URL parameter names are known to not conflict with HTML escapes. This is not, however, the most general solution; when in doubt, escape much and often.

"Always Look on the Bright Side of Life"

Lest these formatting rules sound too clumsy (and send you screaming into the night!), note that the HTML and URL escaping conventions are imposed by the Internet itself, not by Python. (As we've seen, Python has a different mechanism for escaping special characters in string constants with backslashes.) These rules stem from the fact that the Web is based upon the notion of shipping formatted strings around the planet, and they were surely influenced by the tendency of different interest groups to develop very different notations.

You can take heart, though, in the fact that you often don't need to think in such cryptic terms; when you do, Python automates the translation process with library tools. Just keep in mind that any script that generates HTML or URLs dynamically probably needs to call Python's escaping tools to be robust. We'll see both the HTML and URL escape tool sets employed frequently in later examples in this chapter and the next two. In Chapter 15, we'll also meet systems such as Zope that aim to get rid of some of the low-level complexities that CGI scripters face. And as usual in programming, there is no substitute for brains; amazing technologies like the Internet come at a cost in complexity.

12.7 Sending Files to Clients and Servers

It's time to explain a bit of HTML code we've been keeping in the shadows. Did you notice those hyperlinks on the language selector example's main page for showing the CGI script's source code? Normally, we can't see such script source code, because accessing a CGI script makes it execute (we can see only its HTML output, generated to make the new page). The script in Example 12-23, referenced by a hyperlink in the main language.html page, works around that by opening the source file and sending its text as part of the HTML response. The text is marked with <PRE> as pre-formatted text, and escaped for transmission inside HTML with cgi.escape.

Example 12-23. PP2E\Internet\Cgi-Web\Basics\languages-src.cgi
#!/usr/bin/python
#################################################################
# Display languages.cgi script code without running it.
#################################################################
 
import cgi
filename = 'languages.cgi'
 
print "Content-type: text/html\n"       # wrap up in html
print "<TITLE>Languages</TITLE>"
print "<H1>Source code: '%s'</H1>" % filename
print '<HR><PRE>' 
print cgi.escape(open(filename).read(  ))
print '</PRE><HR>' 

When we visit this script on the Web via the hyperlink or a manually typed URL, the script delivers a response to the client that includes the text of the CGI script source file. It appears as in Figure 12-25.

Figure 12-25. Source code viewer page

figs/ppy2_1225.gif

Note that here, too, it's crucial to format the text of the file with cgi.escape, because it is embedded in the HTML code of the reply. If we don't, any characters in the text that mean something in HTML code are interpreted as HTML tags. For example, the C++ < operator character within this file's text may yield bizarre results if not properly escaped. The cgi.escape utility converts it to the standard sequence &lt; for safe embedding.

12.7.1 Displaying Arbitrary Server Files on the Client

Almost immediately after writing the languages source code viewer script in the previous example, it occurred to me that it wouldn't be much more work, and would be much more useful, to write a generic version -- one that could use a passed-in filename to display any file on the site. It's a straightforward mutation on the server side; we merely need to allow a filename to be passed in as an input. The getfile.cgi Python script in Example 12-24 implements this generalization. It assumes the filename is either typed into a web page form or appended to the end of the URL as a parameter. Remember that Python's cgi module handles both cases transparently, so there is no code in this script that notices any difference.

Example 12-24. PP2E\Internet\Cgi-Web\Basics\getfile.cgi
#!/usr/bin/python
#################################################################
# Display any cgi (or other) server-side file without running it.
# The filename can be passed in a URL param or form field; e.g.,
# http://server/~lutz/Basics/getfile.cgi?filename=somefile.cgi.
# Users can cut-and-paste or "View source" to save file locally.
# On IE, running the text/plain version (formatted=0) sometimes
# pops up Notepad, but end-of-lines are not always in DOS format; 
# Netscape shows the text correctly in the browser page instead.
# Sending the file in text/html mode works on both browsers--text
# is displayed in the browser response page correctly. We also 
# check the filename here to try to avoid showing private files;
# this may or may not prevent access to such files in general.
#################################################################
 
import cgi, os, sys
formatted = 1                                  # 1=wrap text in html
privates  = ['../PyMailCgi/secret.py']         # don't show these
 
html = """
<html><title>Getfile response</title>
<h1>Source code for: '%s'</h1>
<hr>
<pre>%s</pre>
<hr></html>"""
 
def restricted(filename):
    for path in privates:
        if os.path.samefile(path, filename):   # unify all paths by os.stat
            return 1                           # else returns None=false
 
try:
    form = cgi.FieldStorage(  )
    filename = form['filename'].value          # url param or form field
except:
    filename = 'getfile.cgi'                   # else default filename
 
try:
    assert not restricted(filename)            # load unless private
    filetext = open(filename).read(  )
except AssertionError:
    filetext = '(File access denied)'
except:
    filetext = '(Error opening file: %s)' % sys.exc_value
 
if not formatted:
    print "Content-type: text/plain\n"         # send plain text
    print filetext                             # works on NS, not IE
else:
    print "Content-type: text/html\n"          # wrap up in html
    print html % (filename, cgi.escape(filetext))

This Python server-side script simply extracts the filename from the parsed CGI inputs object, and reads and prints the text of the file to send it to the client browser. Depending on the formatted global variable setting, it either sends the file in plain text mode (using text/plain in the response header) or wrapped up in an HTML page definition (text/html).

Either mode (and others) works in general under most browsers, but Internet Explorer doesn't handle the plain text mode as gracefully as Netscape -- during testing, it popped up the Notepad text editor to view the downloaded text, but end-of-line characters in Unix format made the file appear as one long line. (Netscape instead displays the text correctly in the body of the response web page itself.) HTML display mode works more portably with current browsers. More on this script's restricted file logic in a moment.

Let's launch this script by typing its URL at the top of a browser, along with a desired filename appended after the script's name. Figure 12-26 shows the page we get by visiting this URL:

http://starship.python.net/~lutz/Basics/getfile.cgi?filename=languages-src.cgi
Figure 12-26. Generic source code viewer page

figs/ppy2_1226.gif

The body of this page shows the text of the server-side file whose name we passed at the end of the URL; once it arrives, we can view its text, cut-and-paste to save it in a file on the client, and so on. In fact, now that we have this generalized source code viewer, we could replace the hyperlink to script languages-src.cgi in language.html, with a URL of this form:

http://starship.python.net/~lutz/Basics/getfile.cgi?filename=languages.cgi

For illustration purposes, the main HTML page in Example 12-16 has links both to the original source code display script, as well as to the previous URL (less the server and directory paths, since the HTML file and getfile script live in the same place). Really, URLs like these are direct calls (albeit, across the Web) to our Python script, with filename parameters passed explicitly. As we've seen, parameters passed in URLs are treated the same as field inputs in forms; for convenience, let's also write a simple web page that allows the desired file to be typed directly into a form, as shown in Example 12-25.

Example 12-25. PP2E\Internet\Cgi-Web\Basics\getfile.html
<html><title>Getfile: download page</title>
<body>
<form method=get action="getfile.cgi">
  <h1>Type name of server file to be viewed</h1>
  <p><input type=text size=50 name=filename>
  <p><input type=submit value=Download>
</form>
<hr><a href="getfile.cgi?filename=getfile.cgi">View script code</a>
</body></html>

Figure 12-27 shows the page we receive when we visit this file's URL. We need to type only the filename in this page, not the full CGI script address.

Figure 12-27. source code viewer selection page

figs/ppy2_1227.gif

When we press this page's Download button to submit the form, the filename is transmitted to the server, and we get back the same page as before, when the filename was appended to the URL (see Figure 12-26). In fact, the filename will be appended to the URL here, too; the get method in the form's HTML instructs the browser to append the filename to the URL, exactly as if we had done so manually. It shows up at the end of the URL in the response page's address field, even though we really typed it into a form.[13]

12.7.1.1 Handling private files and errors

As long as CGI scripts have permission to open the desired server-side file, this script can be used to view and locally save any file on the server. For instance, Figure 12-28 shows the page we're served after asking for file path ../PyMailCgi/index.html -- an HTML text file in another application's subdirectory, nested within the parent directory of this script.[14] Users can specify both relative and absolute paths to reach a file -- any path syntax the server understands will do.

Figure 12-28. Viewing files with relative paths

figs/ppy2_1228.gif

More generally, this script will display any file path for which the user "nobody" (the username under which CGI scripts usually run) has read access. Just about every server-side file used in web applications will, or else they wouldn't be accessible from browsers in the first place. That makes for a flexible tool, but it's also potentially dangerous. What if we don't want users to be able to view some files on the server? For example, in the next chapter, we will implement an encryption module for email account passwords. Allowing users to view that module's source code would make encrypted passwords shipped over the Net much more vulnerable to cracking.

To minimize this potential, the getfile script keeps a list, privates, of restricted filenames, and uses the os.path.samefile built-in to check if a requested filename path points to one of the names on privates. The samefile call checks to see if the os.stat built-in returns the same identifying information for both file paths; because of that, pathnames that look different syntactically but reference the same file are treated as identical. For example, on my server, the following paths to the encryptor module are different strings, but yield a true result from os.path.samefile:

../PyMailCgi/secret.py
/home/crew/lutz/public_html/PyMailCgi/secret.py

Accessing either path form generates an error page like that in Figure 12-29.

Figure 12-29. Accessing private files

figs/ppy2_1229.gif

Notice that bona fide file errors are handled differently. Permission problems and accesses to nonexistent files, for example, are trapped by a different exception handler clause, and display the exception's message to give additional context. Figure 12-30 shows one such error page.

Figure 12-30. File errors display

figs/ppy2_1230.gif

As a general rule of thumb, file-processing exceptions should always be reported in detail, especially during script debugging. If we catch such exceptions in our scripts, it's up to us to display the details (assigning sys.stderr to sys.stdout won't help if Python doesn't print an error message). The current exception's type, data, and traceback objects are always available in the sys module for manual display.

The private files list check does prevent the encryption module from being viewed directly with this script, but it still may or may not be vulnerable to attack by malicious users. This book isn't about security, so I won't go into further details here, except to say that on the Internet, a little paranoia goes a long way. Especially for systems installed on the general Internet (instead of closed intranets), you should assume that the worst-case scenario will eventually happen.

12.7.2 Uploading Client Files to the Server

The getfile script lets us view server files on the client, but in some sense, it is a general-purpose file download tool. Although not as direct as fetching a file by FTP or over raw sockets, it serves similar purposes. Users of the script can either cut-and-paste the displayed code right off the web page or use their browser's View Source option to view and cut.

But what about going the other way -- uploading a file from the client machine to the server? As we saw in the last chapter, that is easy enough to accomplish with a client-side script that uses Python's FTP support module. Yet such a solution doesn't really apply in the context of a web browser; we can't usually ask all of our program's clients to start up a Python FTP script in another window to accomplish an upload. Moreover, there is no simple way for the server-side script to request the upload explicitly, unless there happens to be an FTP server running on the client machine (not at all the usual case).

So is there no way to write a web-based program that lets its users upload files to a common server? In fact, there is, though it has more to do with HTML than with Python itself. HTML <input> tags also support a type=file option, which produces an input field, along with a button that pops up a file-selection dialog. The name of the client-side file to be uploaded can either be typed into the control, or selected with the pop-up dialog. The HTML page file in Example 12-26 defines a page that allows any client-side file to be selected and uploaded to the server-side script named in the form's action option.

Example 12-26. PP2E\Internet\Cgi-Web\Basics\putfile.html
<html><title>Putfile: upload page</title>
<body>
<form enctype="multipart/form-data" 
      method=post 
      action="putfile.cgi">
  <h1>Select client file to be uploaded</h1>
  <p><input type=file size=50 name=clientfile>
  <p><input type=submit value=Upload>
</form>
<hr><a href="getfile.cgi?filename=putfile.cgi">View script code</a>
</body></html>

One constraint worth noting: forms that use file type inputs must also specify a multipart/form-data encoding type and the post submission method, as shown in this file; get style URLs don't work for uploading files. When we visit this page, the page shown in Figure 12-31 is delivered. Pressing its Browse button opens a file-selection dialog, while Upload sends the file.

Figure 12-31. File upload selection page

figs/ppy2_1231.gif

On the client side, when we press this page's Upload button, the browser opens and reads the selected file, and packages its contents with the rest of the form's input fields (if any). When this information reaches the server, the Python script named in the form action tag is run as always, as seen in Example 12-27.

Example 12-27. PP2E\Internet\Cgi-Web\Basics\putfile.cgi
#!/usr/bin/python
#######################################################
# extract file uploaded by http from web browser;
# users visit putfile.html to get the upload form 
# page, which then triggers this script on server;
# note: this is very powerful, and very dangerous:
# you will usually want to check the filename, etc.
# this will only work if file or dir is writeable;
# a unix 'chmod 777 uploads' command may suffice;
# file path names arrive in client's path format;
#######################################################
 
import cgi, string, os, sys
import posixpath, dospath, macpath     # for client paths
debugmode    = 0                       # 1=print form info
loadtextauto = 0                       # 1=read file at once
uploaddir    = './uploads'             # dir to store files
 
sys.stderr = sys.stdout                # show error msgs
form = cgi.FieldStorage(  )              # parse form data
print "Content-type: text/html\n"      # with blank line
if debugmode: cgi.print_form(form)     # print form fields
 
# html templates
 
html = """
<html><title>Putfile response page</title>
<body>
<h1>Putfile response page</h1>
%s
</html>"""
 
goodhtml = html % """
<p>Your file, '%s', has been saved on the server as '%s'. 
<p>An echo of the file's contents received and saved appears below.
</p><hr>
<p><pre>%s</pre>
</p><hr>
"""
 
# process form data
 
def splitpath(origpath):                              # get file at end
    for pathmodule in [posixpath, dospath, macpath]:  # try all clients
        basename = pathmodule.split(origpath)[1]      # may be any server
        if basename != origpath:
            return basename                           # lets spaces pass
    return origpath                                   # failed or no dirs
    
def saveonserver(fileinfo):                           # use file input form data
    basename = splitpath(fileinfo.filename)           # name without dir path
    srvrname = os.path.join(uploaddir, basename)      # store in a dir if set
    if loadtextauto:
        filetext = fileinfo.value                     # reads text into string 
        open(srvrname, 'w').write(filetext)           # save in server file
    else:
        srvrfile = open(srvrname, 'w')                # else read line by line
        numlines, filetext = 0, ''                    # e.g., for huge files
        while 1:
            line = fileinfo.file.readline(  )
            if not line: break
            srvrfile.write(line)
            filetext = filetext + line
            numlines = numlines + 1
        filetext = ('[Lines=%d]\n' % numlines) + filetext
    os.chmod(srvrname, 0666)   # make writeable: owned by 'nobody'
    return filetext, srvrname
 
def main(  ):
    if not form.has_key('clientfile'): 
        print html % "Error: no file was received"
    elif not form['clientfile'].filename:
        print html % "Error: filename is missing"
    else:
        fileinfo = form['clientfile']
        try: 
            filetext, srvrname = saveonserver(fileinfo)
        except:
            errmsg = '<h2>Error</h2><p>%s<p>%s' % (sys.exc_type, sys.exc_value)
            print html % errmsg
        else:
            print goodhtml % (cgi.escape(fileinfo.filename), 
                              cgi.escape(srvrname), 
                              cgi.escape(filetext))
main(  )

Within this script, the Python-specific interfaces for handling uploaded files are employed. They aren't much different, really; the file comes into the script as an entry in the parsed form object returned by cgi.FieldStorage as usual; its key is clientfile, the input control's name in the HTML page's code.

This time, though, the entry has additional attributes for the file's name on the client. Moreover, accessing the value attribute of an uploaded file input object will automatically read the file's contents all at once into a string on the server. For very large files, we can instead read line by line (or in chunks of bytes). For illustration purposes, the script implements either scheme: based on the setting of the loadtextauto global variable, it either asks for the file contents as a string, or reads it line by line.[16] In general, the CGI module gives us back objects with the following attributes for file upload controls:

filename

The name of the file as specified on the client

file

A file object from which the uploaded file's contents can be read

value

The contents of the uploaded file (read from file on demand)

There are additional attributes not used by our script. Files represent a third input field object; as we've also seen, the value attribute is a string for simple input fields, and we may receive a list of objects for multiple-selection controls.

For uploads to be saved on the server, CGI scripts (run by user "nobody") must have write access to the enclosing directory if the file doesn't yet exist, or to the file itself if it does. To help isolate uploads, the script stores all uploads in whatever server directory is named in the uploaddir global. On my site's Linux server, I had to give this directory a mode of 777 (universal read/write/execute permissions) with chmod to make uploads work in general. Your mileage may vary, but be sure to check permissions if this script fails.

The script also calls os.chmod to set the permission on the server file such that it can be read and written by everyone. If created anew by an upload, the file's owner will be "nobody," which means anyone out in cyberspace can view and upload the file. On my server, though, the file will also be only writable by user "nobody" by default, which might be inconvenient when it comes time to change that file outside the Web (the degree of pain can vary per operation).

Isolating client-side file uploads by placing them in a single directory on the server helps minimize security risks: existing files can't be overwritten arbitrarily. But it may require you to copy files on the server after they are uploaded, and it still doesn't prevent all security risks -- mischievous clients can still upload huge files, which we would need to trap with additional logic not present in this script as is. Such traps may only be needed in scripts open to the Internet at large.

If both client and server do their parts, the CGI script presents us with the response page shown in Figure 12-32, after it has stored the contents of the client file in a new or existing file on the server. For verification, the response gives the client and server file paths, as well as an echo of the uploaded file with a line count (in line-by-line reader mode).

Figure 12-32. Putfile response page

figs/ppy2_1232.gif

Incidentally, we can also verify the upload with the getfile program we wrote in the prior section. Simply access the selection page to type the pathname of the file on the server, as shown in Figure 12-33.

Figure 12-33. Verifying putfile with getfile -- selection

figs/ppy2_1233.gif

Assuming uploading the file was successful, Figure 12-34 shows the resulting viewer page we will obtain. Since user "nobody" (CGI scripts) was able to write the file, "nobody" should be able to view it as well.

Figure 12-34. Verifying putfile with getfile -- response

figs/ppy2_1234.gif

Notice the URL in this page's address field -- the browser translated the / character we typed into the selection page to a %2F hexadecimal escape code before adding it to the end of the URL as a parameter. We met URL escape codes like this earlier in this chapter. In this case, the browser did the translation for us, but the end result is as if we had manually called one of the urllib quoting functions on the file path string.

Technically, the %2F escape code here represents the standard URL translation for non-ASCII characters, under the default encoding scheme browsers employ. Spaces are usually translated to + characters as well. We can often get away without manually translating most non-ASCII characters when sending paths explicitly (in typed URLs). But as we saw earlier, we sometimes need to be careful to escape characters (e.g., &) that have special meaning within URL strings with urllib tools.

12.7.2.1 Handling client path formats

In the end, the putfile.cgi script stores the uploaded file on the server, within a hardcoded uploaddir directory, under the filename at the end of the file's path on the client (i.e., less its client-side directory path). Notice, though, that the splitpath function in this script needs to do extra work to extract the base name of the file on the right. Browsers send up the filename in the directory path format used on the client machine; this path format may not be the same as that used on the server where the CGI script runs.

The standard way to split up paths, os.path.split, knows how to extract the base name, but only recognizes path separator characters used on the platform it is running on. That is, if we run this CGI script on a Unix machine, os.path.split chops up paths around a / separator. If a user uploads from a DOS or Windows machine, however, the separator in the passed filename is \, not /. Browsers running on a Macintosh may send a path that is more different still.

To handle client paths generically, this script imports platform-specific, path-processing modules from the Python library for each client it wishes to support, and tries to split the path with each until a filename on the right is found. For instance, posixpath handles paths sent from Unix-style platforms, and dospath recognizes DOS and Windows client paths. We usually don't import these modules directly since os.path.split is automatically loaded with the correct one for the underlying platform; but in this case, we need to be specific since the path comes from another machine. Note that we could have instead coded the path splitter logic like this to avoid some split calls:

def splitpath(origpath):                                    # get name at end
    basename = os.path.split(origpath)[1]                   # try server paths
    if basename == origpath:                                # didn't change it?
        if '\\' in origpath:
            basename = string.split(origpath, '\\')[-1]     # try dos clients
        elif '/' in origpath:
            basename = string.split(origpath, '/')[-1]      # try unix clients
    return basename

But this alternative version may fail for some path formats (e.g., DOS paths with a drive but no backslashes). As is, both options waste time if the filename is already a base name (i.e., has no directory paths on the left), but we need to allow for the more complex cases generically.

This upload script works as planned, but a few caveats are worth pointing out before we close the book on this example:

·         First, putfile doesn't do anything about cross-platform incompatibilities in filenames themselves. For instance, spaces in a filename shipped from a DOS client are not translated to nonspace characters; they will wind up as spaces in the server-side file's name, which may be legal but which are difficult to process in some scenarios.

·         Second, the script is also biased towards uploading text files; it opens the output file in text mode (which will convert end-of-line marker codes in the file to the end-of-line convention on the web server machine), and reads input line-by-line (which may fail for binary data).

If you run into any of these limitations, you will have crossed over into the domain of suggested exercises.

12.7.3 More Than One Way to Push Bits Over the Net

Finally, let's discuss some context. We've seen three getfile scripts at this point in the book. The one in this chapter is different than the other two we wrote in earlier chapters, but it accomplishes a similar goal:

·         This chapter's getfile is a server-side CGI script that displays files over the HTTP protocol (on port 80).

·         In Chapter 10, we built a client and server-side getfile to transfer with raw sockets (on port 50001) and Chapter 11 implemented a client-side getfile to ship over FTP (on port 21)

The CGI- and HTTP-based putfile script here is also different from the FTP-based putfile in the last chapter, but it can be considered an alternative to both socket and FTP uploads. To help underscore the distinctions, Figure 12-35 and Figure 12-36 show the new putfile uploading the original socket-based getfile.[17]

Figure 12-35. A new putfile with the socket-based getfile uploaded

figs/ppy2_1235.gif

Really, the getfile CGI script in this chapter simply displays files only, but can be considered a download tool when augmented with cut-and-paste operations in a web browser. Figures Figure 12-37 and Figure 12-38 show the CGI getfile displaying the uploaded socket-based getfile.

Figure 12-36. A new putfile with the socket-based getfile

figs/ppy2_1236.gif

Figure 12-37. A new getfile with the socket-based getfile

figs/ppy2_1237.gif

Figure 12-38. A new getfile with the socket-based getfile downloaded

figs/ppy2_1238.gif

The point to notice here is that there are a variety of ways to ship files around the Internet -- sockets, FTP, and HTTP (web pages) can all move files between computers. Technically speaking, we can transfer files with other techniques and protocols, too -- POP email, NNTP news, and so on.

Each technique has unique properties but does similar work in the end: moving bits over the Net. All ultimately run over sockets on a particular port, but protocols like FTP add additional structure to the socket layer, and application models like CGI add both structure and programmability.

[1] Given that this edition may not be updated for many years, it's not impossible that the server name in this address starship.python.net might change over time. If this address fails, check the book updates at http://rmi.net/~lutz/about-pp.html to see if a new examples site address has been posted. The rest of the main page's URL will likely be unchanged. Note, though, that some examples hardcode the starship host server name in URLs; these will be fixed on the new server if moved, but not on your book CD. Run script fixsitename.py later in this chapter to change site names automatically. [back]

[2] These are not necessarily magic numbers. On Unix machines, mode 755 is a bit mask. The first 7 simply means that you (the file's owner) can read, write, and execute the file (7 in binary is 111 -- each bit enables an access mode). The two 5s (binary 101) say that everyone else (your group and others) can read and execute (but not write) the file. See your system's manpage on the chmod command for more details. [back]

[3] To make this process easier, the fixsitename.py script presented in the next section largely automates the necessary changes by performing global search-and-replace operations and directory walks. A few book examples do use complete URLs, so be sure to run this script after copying examples to a new site. [back]

[4] Notice that the script does not generate the enclosing <HEAD> and <BODY> tags in the static HTML file of the prior section. Strictly speaking, it should -- HTML without such tags is invalid. But all commonly used browsers simply ignore the omission. [back]

[5] As I mentioned at the start of this chapter, there are often multiple ways to accomplish any given webmaster-y task. For instance, the HTML <BASE> tag may provide an alternative way to map absolute URLs, and FTPing your web site files to your server individually and in text mode might obviate line-end issues. There are undoubtedly other ways to handle such tasks, too. On the other hand, such alternatives wouldn't be all that useful in a book that illustrates Python coding techniques. [back]

[6] This technique isn't unique to CGI scripts, by the way. In Chapter 15, we'll meet systems that embed Python code inside HTML. There is no good way to test such code outside the context of the enclosing system, without extracting the embedded Python code (perhaps by using the htmllib HTML parser that comes with Python) and running it with a passed-in mock-up of the API that it will eventually use. [back]

[7] Two forward references are worth noting here. Besides simple strings and lists, later we'll see a third type of form input object, returned for fields that specify file uploads. The script in this example should really also escape the echoed text inserted into the HTML reply to be robust, lest it contain HTML operators. We will discuss escapes in detail later. [back]

[8] If you are reading closely, you might notice that this is the second time we've used mock-ups in this chapter (see the earlier test4.cgi example). If you find this technique generally useful, it would probably make sense to put the dummy class, along with a function for populating a form dictionary on demand, into a module so it can be reused. In fact, we will do that in the next section. Even for two-line classes like this, typing the same code the third time around will do much to convince you of the power of code reuse. [back]

[9] Interestingly, we also get the "All" reply if debugme is set to when we run the script from the command line. The cgi.FieldStorage call returns an empty dictionary if called outside the CGI environment rather than throwing an exception, so the test for a missing key kicks in. It's likely safer to not rely on this behavior, however. [back]

[10] See the urllib module examples in the prior and following chapters for a way to send this URL from a Python script. urllib lets programs fetch web pages and invoke remote CGI scripts by building and submitting URL strings like this one, with any required parameters filled in at the end of the string. You could use this module, for instance, to automatically send information to order Python books at an online bookstore from within a Python script, without ever starting a web browser. [back]

[11] The HTML code template could be loaded from an external text file, too, but external text files are no more easily changed than Python scripts. In fact, Python scripts are text files, and this is a major feature of the language: it's usually easy to change the Python scripts of an installed system onsite, without re-compile or re-link steps. [back]

[12] This assumes, of course, that this module can be found on the Python module search path when those scripts are run. See the search path discussion earlier in this chapter. Since Python searches the current directory for imported modules by default, this always works without sys.path changes if all of our files are in our main web directory. [back]

[13] You may notice one difference in the response pages produced by the form and an explicitly typed URL: for the form, the value of the "filename" parameter at the end of the URL in the response may contain URL escape codes for some characters in the file path you typed. Browsers automatically translate some non-ASCII characters into URL escapes (just like urllib.quote). URL escapes are discussed earlier in this chapter; we'll see an example of this automatic browser escaping at work in a moment. [back]

[14] PyMailCgi is described in the next chapter. If you're looking for source files for PyErrata (also in the next chapter), use a path like .. /PyErrata/xxx. In general, the top level of the book's web site corresponds to the top level of the Internet/Cgi-Web directory in the examples on the book's CD-ROM (see http://examples.oreilly.com/python2); getfile runs in subdirectory Basics. [back]

[16] Note that reading line means that this CGI script is biased towards uploading text files, not binary data files. The fact that it also uses a "w" open mode makes it ill suited for binary uploads if run on a Windows server -- \r characters might be added to the data when written. See Chapter 2 for details if you've forgotten why. [back]

[17] Shown here being loaded from a now defunct Part2 directory -- replace Part2 with PP2E to find its true location, and don't be surprised if a few difference show up in transferred files contents if you run such examples yourself. Like I said, engineers love to change things. [back]

Chapter 11  TOC  Chapter 13