Chapter 10. Data PersistenceContents:Many basic web applications can be created that output only email and web documents. However, if you begin building larger web applications, you will eventually need to store data and retrieve it later. This chapter will discuss various ways to do this with different levels of complexity. Text files are the simplest way to maintain data, but they quickly become inefficient when the data becomes complex or grows too large. A DBM file provides much faster access, even for large amounts of data, and DBM files are very easy to use with Perl. However, this solution is also limited when the data grows too complex. Finally, we will investigate relational databases. A relational database management system (RDBMS) provides high performance even with complex queries. However, an RDBMS is more complicated to set up and use than the other solutions. Applications evolve and grow larger. What may start out as a short, simple CGI script may gain feature upon feature until it has grown to a large, complex application. Thus, when you design web applications, it is a good idea to develop them so that they are easily expandable. One solution is to make your solutions modular. You should try to abstract the code that reads and writes data so the rest of the code does not know how the data is stored. By reducing the dependency on the data format to a small chunk of code, it becomes easier to change your data format as you need to grow. 10.1. Text FilesOne of Perl's greatest strengths is its ability to parse text, and this makes it especially easy to get a web application online quickly using text files as the means of storing data. Although it does not scale to complex queries, this works well for small amounts of data and is very common for Perl CGI applications. We're not going to discuss how to use text files with Perl, since most Perl programmers are already proficient at that task. We're also not going to look at strategies like creating random access files to improve performance, since that warrants a lengthy discussion, and a DBM file is generally a better substitute. We'll simply look at the issues that are particular to using text files with CGI scripts. 10.1.1. LockingIf you write to any files from a CGI script, then you must use some form of file locking. Web servers support numerous concurrent connections, and if two users try to write to the same file at the same time, the result is generally corrupted or truncated data. 10.1.1.1. flockIf your system supports it, using the flock command is the easiest way to do this. How do you know if your system supports flock? Try it: flock will die with a fatal error if your system does not support it. However, flock works reliably only on local files; flock does not work across most NFS systems, even if your system otherwise supports it.[19] flock offers two different modes of locking: exclusive and shared. Many processes can read from a file simultaneously without problems, but only one process should write to the file at a time (and no other process should read from the file while it is being written). Thus, you should obtain an exclusive lock on a file when writing to it and a shared lock when reading from it. The shared lock verifies that no one else has an exclusive lock on the file and delays any exclusive locks until the shared locks have been released.
To use flock, call it with a filehandle to an open file and a number indicating the type of lock you want. These numbers are system-dependent, so the easiest way to get them is to use the Fcntl module. If you supply the :flock argument to Fcntl, it will export LOCK_EX, LOCK_SH, LOCK_UN, and LOCK_NB for you. You can use them as follows: use Fcntl ":flock"; open FILE, "some_file.txt" or die $!; flock FILE, LOCK_EX; # Exclusive lock flock FILE, LOCK_SH; # Shared lock flock FILE, LOCK_UN; # Unlock Closing a filehandle releases any locks, so there is generally no need to specifically unlock a file. In fact, it can be dangerous to do so if you are locking a filehandle that uses Perl's tie mechanism. See file locking in the DBM section of this chapter for more information. Some systems do not support shared file locks and use exclusive locks for them instead. You can use the script in Example 10-1 to test what flock supports on your system. Example 10-1. flock_test.pl#!/usr/bin/perl -wT use IO::File; use Fcntl ":flock"; *FH1 = new_tmpfile IO::File or die "Cannot open temporary file: $!\n"; eval { flock FH1, LOCK_SH }; $@ and die "It does not look like your system supports flock: $@\n"; open FH2, ">> &FH1" or die "Cannot dup filehandle: $!\n"; if ( flock FH2, LOCK_SH | LOCK_NB ) { print "Your system supports shared file locks\n"; } else { print "Your system only supports exclusive file locks\n"; } If you need to both read and write to a file, then you have two options: you can open the file exclusively for read/write access, or if you only have to do limited writing and what you're writing does not depend on the contents of the file, you can open and close the file twice: once shared for reading and once exclusive for writing. This is generally less efficient than opening the file once, but if you have lots of processes needing to access the file that are doing lots of reading and little writing, it may be more efficient to reduce the time that one process is tying up the file while holding an exclusive lock on it. Typically when you use flock to lock a file, it halts the execution of your script until it can obtain a lock on your file. The LOCK_NB option tells flock that you do not want it to block execution, but allow your script to continue if it cannot obtain a lock. Here is one way to time out if you cannot obtain a lock on a file: my $count = 0; my $delay = 1; my $max = 15; open FILE, ">> $filename" or error( $q, "Cannot open file: your data was not saved" ); until ( flock FILE, LOCK_SH | LOCK_NB ) { error( $q, "Timed out waiting to write to file: " . "your data was not saved" ) if $count >= $max; sleep $delay; $count += $delay; } In this example, the code tries to get a lock. If it fails, it waits a second and tries again. After fifteen seconds, it gives up and reports an error. 10.1.1.2. Manual lock filesIf your system does not support flock, you will need to manually create your own lock files. As the Perl FAQ points out (see perlfaq5 ), this is not as simple as you might think. The problem is that you must check for the existence of a file and create the file as one operation. If you first check whether a lock file exists, and then try to create one if it does not, another process may have created its own lock file after you checked, and you just overwrote it. To create your own lock file, use the following command: use Fcntl; . . . sysopen LOCK_FILE, "$filename.lock", O_WRONLY | O_EXCL | O_CREAT, 0644 or error( $q, "Unable to lock file: your data was not saved" ): The O_EXCL function provided by Fcntl tells the system to open the file only if it does not already exist. Note that this will not reliably work on an NFS filesystem. 10.1.2. Write PermissionsIn order to create or update a text file, you must have the appropriate permissions. This may sound basic, but it is a common source of errors in CGI scripts, especially on Unix filesystems. Let's review how Unix file permissions work. Files have both an owner and a group. By default, these match the user and group of the user or process who creates the file. There are three different levels of permissions for a file: the owner's permissions, the group's permissions, and everyone else's permissions. Each of these may have read access, write access, and/or execute access for a file. Your CGI scripts can only modify a file if nobody (or the user your web server runs as) has write access to the file. This occurs if the file is writable by everyone, if it is writable by members of the file's group and nobody is a member of that group, or if nobody owns the file and the file is writable by its owner. In order to create or remove a file, nobody must have write permission to the directory containing the file. The same rules about owner, group, and other users apply to directories as they do for files. In addition, the execute bit must be set for the directory. For directories, the execute bit determines scan access, which is the ability to change to the directory. Even though your CGI script may not modify a file, it may be able to replace it. If nobody has permission to write to a directory, then it can remove files in the directory in addition to creating new files, even with the same name. Write permissions on the file do not typically affect the ability to remove or replace the file as a whole. 10.1.3. Temporary FilesYour CGI scripts may need to create temporary files for a number of reasons. You can reduce memory consumption by creating files to hold data as you process it; you gain efficiency by sacrificing performance. You may also use external commands that perform their actions on text files. 10.1.3.1. Anonymous temporary filesTypically, temporary files are anonymous; they are created by opening a handle to a new file and then immediately deleting the file. Your CGI script will continue to have a filehandle to access the file, but the data cannot be accessed by other processes, and the data will be reclaimed by the filesystem once your CGI script closes the filehandle. (Not all systems support this feature.) As for most common tasks, there is a Perl module that makes managing temporary files much simpler. IO::File will create anonymous temporary files for you with the new_tmpfile class method; it takes no arguments. You can use it like this:[20]
use IO::File; . . . my $tmp_fh = new_tmpfile IO::File; You can then read and write to $tmp_fh just as you would any other filehandle: print $tmp_fh "</html>\n"; seek $tmp_fh, 0, 0; while (<$tmp_fh>) { print; } 10.1.3.2. Named temporary filesAnother option is to create a file and delete it when you are finished with it. One advantage is that you have a filename that can be passed to other processes and functions. Also, using the IO::File module is considerably slower than managing the file yourself. However, using named temporary files has two drawbacks. First, greater care must be taken choosing a unique filename so that two scripts will not attempt to use the same temporary file at the same time. Second, the CGI script must delete the file when it is finished, even if it encounters an error and exits prematurely. The Perl FAQ suggests using the POSIX module to generate a temporary filename and an END block to ensure it will be cleaned up: use Fcntl; use POSIX qw(tmpnam); . . . my $tmp_filename; # try new temporary filenames until we get one that doesn't already # exist; the check should be unnecessary, but you can't be too careful do { $tmp_filename = tmpnam( ) } until sysopen( FH, $name, O_RDWR|O_CREAT|O_EXCL ); # install atexit-style handler so that when we exit or die, # we automatically delete this temporary file END { unlink( $tmp_filename ) or die "Couldn't unlink $name: $!" } If your system doesn't support POSIX, then you will have to create the file in a system-dependent fashion instead. 10.1.4. DelimitersIf you need to include multiple fields of data in each line of your text file, you will likely use delimiters to separate them. Another option is to create fixed-length records, but we won't get into these files here. Common characters to use for delimiting files are commas, tabs, and pipes (|). Commas are primarily used in CSV files, which we will discuss presently. CSV files can be difficult to parse accurately because they can include non-delimiting commas as part of a value. When working with CSV files, you may want to consider the DBD::CSV module; this gives you a number of additional benefits, which we will discuss shortly. Tabs are not generally included within data, so they make convenient delimiters. Even so, you should always check your data and encode or remove any tabs or end-of-line characters before writing to your file. This ensures that your data does not become corrupted if someone happens to pass a newline character in the middle of a field. Remember, even if you are reading data from an HTML form element that would not normally accept a newline character as part of it, you should never trust the user or that user's browser. Here is an example of functions you can use to encode and decode data: sub encode_data { my @fields = map { s/\\/\\\\/g; s/\t/\\t/g; s/\n/\\n/g; s/\r/\\r/g; $_; } @_; my $line = join "\t", @fields; return "$line\n"; } sub decode_data { my $line = shift; chomp $line; my @fields = split /\t/, $line; return map { s/\\(.)/$1 eq 't' and "\t" or $1 eq 'n' and "\n" or $1 eq 'r' and "\r" or "$1"/eg; $_; } @fields; } These functions encode tabs and end-of-line characters with the common escape characters that Perl and other languages use (\t, \r, and \n). Because it is introducing additional backslashes as an escape character, it must also escape the backslash character. The encode_data sub takes a list of fields and returns a single encoded scalar that can be written to the file; decode_data takes a line read from the file and returns a list of decoded fields. You can use them as shown in Example 10-2. Example 10-2. sign_petition.cgi#!/usr/bin/perl -wT use strict; use Fcntl ":flock"; use CGI; use CGIBook::Error; my $DATA_FILE = "/usr/local/apache/data/tab_delimited_records.txt"; my $q = new CGI; my $name = $q->param( "name" ); my $comment = substr( $q->param( "comment" ), 0, 80 ); unless ( $name ) { error( $q, "Please enter your name." ); } open DATA_FILE, ">> $DATA_FILE" or die "Cannot append to $DATA_FILE: $!"; flock DATA_FILE, LOCK_EX; seek DATA_FILE, 0, 2; print DATA_FILE encode_data( $name, $comment ); close DATA_FILE; print $q->header( "text/html" ), $q->start_html( "Our Petition" ), $q->h2( "Thank You!" ), $q->p( "Thank you for signing our petition. ", "Your name has been been added below:" ), $q->hr, $q->start_table, $q->tr( $q->th( "Name", "Comment" ) ); open DATA_FILE, $DATA_FILE or die "Cannot read $DATA_FILE: $!"; flock DATA_FILE, LOCK_SH; while (<DATA_FILE>) { my @data = decode_data( $_ ); print $q->tr( $q->td( @data ) ); } close DATA_FILE; print $q->end_table, $q->end_html; sub encode_data { my @fields = map { s/\\/\\\\/g; s/\t/\\t/g; s/\n/\\n/g; s/\r/\\r/g; $_; } @_; my $line = join "\t", @fields; return $line . "\n"; } sub decode_data { my $line = shift; chomp $line; my @fields = split /\t/, $line; return map { s/\\(.)/$1 eq 't' and "\t" or $1 eq 'n' and "\n" or $1 eq 'r' and "\r" or "$1"/eg; $_; } @fields; } Note that organizing your code this way gives you another benefit. If you later decide you want to change the format of your data, you do not need to change your entire CGI script, just the encode_data and decode_data functions. 10.1.5. DBD::CSVAs we mentioned at the beginning of this chapter, it's great to modularize your code so that changing the data format affects only a small chunk of your application. However, it's even better if you don't have to change that chunk either. If you are creating a simple application that you expect to grow, you may want to consider developing your application using CSV files. CSV (comma separated values) files are text files formatted such that each line is a record, and fields are delimited by commas. The advantage to using CSV files is that you can use Perl's DBI and DBD::CSV modules, which allow you to access the data via basic SQL queries just as you would for an RDBMS. Another benefit of CSV format is that it is quite common, so you can easily import and export it from other applications, including spreadsheets like Microsoft Excel. There are drawbacks to developing with CSV files. DBI adds a layer of complexity to your application that you would not otherwise need if you accessed the data directly. DBI and DBD::CSV also allow you to create only simple SQL queries, and it is certainly not as fast as a true relational database system, especially for large amounts of data. However, if you need to get a project going, knowing that you will move to an RDBMS, and if DBD::CSV meets your immediate requirements, then this strategy is certainly a good choice. We will look at an example that uses DBD::CSV later in this chapter. Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|