Fixed-length Random-access Databases (Learning Perl, 3rd Edition)

16.3. Fixed-length Random-access Databases

Another form of persistent data is the fixed-length, record-oriented disk file.[350] In this scheme, the data consists of a number of records of identical length. The numbering of the records is either not important or determined by some indexing scheme.

[350]By "fixed-length," we don't mean that the file itself is of a fixed length; it's each individual record that is of a fixed length. In this section, we'll use an example file in which every record is 55 bytes long.

For example, we might want to store some information about each bowler at Bedrock Lanes. Let's say we decide to have a series of records, one per bowler, in which the data holds the player's name, age, last five bowling scores, and the time and date of his last game.

We need to decide upon a suitable format for this data. Let's say that after studying the available formats in the documentation for pack, we decide to use 40 characters for the player's name, a one-byte integer for his age,[351] five two-byte integers for his last five scores,[352] and a four-byte integer for the timestamp of his most-recent game,[353] giving a format string of "a40 C I5 L". Each record is thus 55 bytes long. If we were reading all of the data in the database, we'd read chunks of 55 bytes until we got to the end. If we wanted to go to the fifth record, we'd skip ahead 4 x 55 bytes (220 bytes) and read the fifth record directly.

[351]Since one byte may have 256 different values, this will hold ages from 0 to 255 with ease. If Methuselah comes to bowl in Bedrock, we'll have to redesign the database.

[352]We can't use one-byte integers for the scores, because a bowling score can be as high as 300. Two-byte integers can hold values from 0 to 65535 (if unsigned) or -32768 to 32767 (if signed). We can use some of these extra values as special codes; for example, if a player has only three games on record, the other scores could be set to 9999 to indicate this.

[353]The standard Unix timestamp format (and the time value used by many other systems) is a 32-bit integer, which fits into four bytes, of course. You'll probably find it handy to use a module to manipulate time and date formats.

Perl supports programs that use such a disk file. In order to do so, however, you need to learn a few more things, including how to:

Open a disk file for both reading and writing
Move around in this file to an arbitrary position
Fetch data by a length rather than up to the next newline
Write data down in fixed-length blocks

The open function has an additional mode we haven't shown yet. If you use "+<" at the front of the filename parameter's string, that is similar to using "<" to open the existing file for reading, except that it also asks for write permission on the file. Thus you can have read/write access to the file:

open(FRED, "<fred");  # open file fred for reading (error if file absent)
open(FRED, "+<fred"); # open file fred read/write (error if file absent)

Similarly, "+>" says to create a new file (as ">" would), but to have read access to it as well, thus also giving read/write access:

open(WILMA, ">wilma");  # make new file wilma (wiping out existing file)
open(WILMA, "+>wilma"); # make new file wilma, but also with read access

Do you see the important difference between the two new modes? Both give read/write access to a file. But "+<" lets you work with an existing file; it doesn't create it. The second mode, "+>" isn't often useful, because it gives read/write access to a new, empty file that it has just created. That's mostly used for temporary (scratch) files.

Once we've got the file open, we need to move around in it. You do this with the seek function:

seek(FRED, 55 * $n, 0);  # seek to start of record $n

The first parameter to seek is a filehandle, the second parameter gives the offset in bytes from the start of the file, and the third parameter is zero.[354] To get to a certain record in our file of bowling data, you'll need to skip over some other records. Since each record is 55 bytes long, we'll multiply $n times 55 to find out which byte position we want. (Note that the record numbers are thus zero-based; record zero is at the beginning of the file.)

[354]Actually, the third parameter is the "whence" parameter. You can use a different value than zero if you want to seek to a position relative to the current position, or relative to the end of the file; see the perlfuncmanpage for more information. Most people will simply want to use zero here.

Once the file pointer has been positioned with seek, the next input or output operation will start at that position.

When we're ready to read from the file, we can't use the ordinary line-input operator because that's made to read lines, not 55-byte records. There may not be a newline character in this entire file, or it may appear in packed data in the middle of a record. Instead, we'll use the read function:

my $buf;  # The input buffer variable
my $number_read = read(FRED, $buf, 55);

As you can see, the first parameter to read is the filehandle. The second parameter is a buffer variable; the data read will be placed into this variable. (Yes, this is an odd way to get the result.) The third parameter is the number of bytes to read; here we've asked for 55 bytes, since that's the size of our record. Normally, you can expect the length of $buf to be the specified number of bytes, and you can expect that the return value (in $number_read) to be the same. But if your current position in the file is only five bytes from the end when you request 55 bytes, you'll get only five. Under normal circumstances, you'll get as many bytes as you ask for.

Once you've got those 55 bytes, what can you do with them? You can unpack them (using the format we previously designed) to get the bowler's name and other information, of course:

my($name, $age, $score_1, $score_2, $score_3, $score_4, $score_5, $when)
  = unpack "a40 C I5 L", $buf;

Since we can read the information from the file with read, can you guess how we can write it back into the file? Sorry, it's not write; that was a trick question.[355] You already know the correct function, which is print. But you have to be sure that the data string is exactly the right size; if it's too large, you'll overwrite the next record's data, but if it's too small, leftover data in the current record may be mixed with the new data. To ensure that the length is correct, we'll use pack. Let's say that Wilma has just bowled a game and her new score is in $new_score. That will be the first of the five most-recent scores we keep for her ($score_5, as the oldest one, will be discarded), and in place of $when (the timestamp of her previous game), we'll store the current time from the time function:

[355]Perl actually does have a write function, but that is used with formats, which are beyond the scope of this book. See the perlformmanpage.

print FRED pack("a40 C I5 L",
  $name, $age,
  $new_score, $score_1, $score_2, $score_3, $score_4,
  time);

On some systems, you'll have to use seek whenever you switch from reading to writing, even if the current position in the file is already correct. It's not a bad idea, then, to always use seek right before reading or printing.

Rather than use the two constant values "a40 C I5 L" and 55 throughout the program, as we've done here, it would generally be better to define them just once near the top of the code. That way, if we ever need to change the database format, we don't have to go searching through our code for places where the number 55 appears. Here's one way you might define both of those values, using the length function to determine the length of a string so you won't have to count bytes:

my $pack_format = "a40 C I5 L";
my $pack_length = length pack($pack_format, "dummy data", 
  0, 1, 2, 3, 4, 5, 6);