Flat-File Databases (Programming the Perl DBI)

2.4. Flat-File Databases

The simplest type of database that we can create and manipulate is the old standby, the flat-file database. This database is essentially a file, or group of files, that contains data in a known and standard format that a program scans for the requested information. Modifications to the data are usually done by updating an in-memory copy of the data held in the file, or files, then writing the entire set of data back out to disk. Flat-file databases are typically ASCII text files containing one record of information per line. The line termination serves as the record delimiter.

In this section we'll be examining the two main types of flat-file database: files that separate fields with a delimiter character, and files that allocate a fixed length to each field. We'll discuss the pros and cons of each type of data file and give you some example code for manipulating them.

The most common format used for flat-file databases is probably the delimited file in which each field is separated by a delimiting character. And possibly the most common of these delimited formats is the comma-separated values (CSV) file, in which fields are separated from one another by commas. This format is understood by many common programs, such as Microsoft Access and spreadsheet programs. As such, it is an excellent base-level and portable format useful for sharing data between applications.[8]

[8]More excitingly, a DBI driver called DBD::CSV exists that allows you to write SQL code to manipulate a flat file containing CSV data.

Figure 2-1. The /etc/passwd file record format

2.4.1. Querying Data

Since delimited files are a very low-level form of storage manager, any manipulations that we wish to perform on the data must be done using operating system functions and low-level query logic, such as basic string comparisons. The following program illustrates how we can open a data file containing colon-separated records of megalith data, search for a given site, and return the data if found:

#!/usr/bin/perl -w
#
# ch02/scanmegadata/scanmegadata: Scans the given megalith data file for
#                                 a given site. Uses colon-separated data.
#

### Check the user has supplied an argument for
###     1) The name of the file containing the data
###     2) The name of the site to search for
die "Usage: scanmegadata <data file> <site name>\n" 
    unless @ARGV == 2;

my $megalithFile = $ARGV[0];
my $siteName     = $ARGV[1];

### Open the data file for reading, and die upon failure
open MEGADATA, "<$megalithFile"
    or die "Can't open $megalithFile: $!\n";

### Declare our row field variables
my ( $name, $location, $mapref, $type, $description );

### Declare our 'record found' flag
my $found;

### Scan through all the entries for the desired site
while ( <MEGADATA> ) {

    ### Remove the newline that acts as a record delimiter
    chop;
    
    ### Break up the record data into separate fields
    ( $name, $location, $mapref, $type, $description ) =
         split( /:/, $_ );
    
    ### Test the sitename against the record's name
    if ( $name eq $siteName ) {
         $found = $.;  # $. holds current line number in file
         last;
    }
}

### If we did find the site we wanted, print it out
if ( $found ) {
     print "Located site: $name on line $found\n\n";
     print "Information on $name ( $type )\n";
     print "===============",
         ( "=" x ( length($name) + length($type) + 5 ) ), "\n";
     print "Location:      $location\n";
     print "Map Reference: $mapref\n";
     print "Description:   $description\n";
}

### Close the megalith data file
close MEGADATA;

exit;

For example, running that program with a file containing a record in the following format:

Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:The most famous stone circle

and a search term of Stonehenge would return the following information:

Located site: Stonehenge on line 1

Information on Stonehenge ( Stone Circle and Henge )
====================================================
Location:      Wiltshire
Map Reference: SU 123 400
Description:   The most famous stone circle

indicating that our brute-force scan and test for the correct site has worked. As you can clearly see from the example program, we have used Perl's own native file I/O functions for reading in the data file, and Perl's own string handling functions to break up the delimited data and test it for the correct record.

The downside to delimited file formats is that if any piece of data contains the delimiting character, you need to be especially careful not to break up the records in the wrong place. Using the Perl split() function with a simple regular expression, as used above, does not take this into account and could produce wrong results. For example, a record containing the following information would cause the split() to happen in the wrong place:

Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:Stonehenge: The most famous stone circle

The easiest quick-fix technique is to translate any delimiter characters in the string into some other character that you're sure won't appear in your data. Don't forget to do the reverse translation when you fetch the records back.

Another common way of storing data within flat files is to use fixed-length records in which to store the data. That is, each piece of data fits into an exactly sized space in the data file. In this form of database, no delimiting character is needed between the fields. There's also no need to delimit each record, but we'll continue to use ASCII line termination as a record delimiter in our examples because Perl makes it very easy to work with files line by line.

Using fixed-width fields is similar to the way in which data is organized in more powerful database systems such as an RDBMS. The pre-allocation of space for record data allows the storage manager to make assumptions about the layout of the data on disk and to optimize accordingly. For our megalithic data purposes, we could settle on the data sizes of:[10]

[10]The fact that these data sizes are all powers of two has no significance other than to indicate that the authors are old enough to remember when powers of two were significant and useful sometimes. They generally aren't anymore.

Field           Required Bytes
-----           --------------
Name            64
Location        64
Map Reference   16
Type            32
Description     256

Storing the data in this format requires slightly different storage manager logic to be used, although the standard Perl file I/O functions are still applicable. To test this data for the correct record, we need to implement a different way of extracting the fields from within each record. For a fixed-length data file, the Perl function unpack() is perfect. The following code shows how the unpack() function replaces the split() used above:

### Break up the record data into separate fields
### using the data sizes listed above
( $name, $location, $mapref, $type, $description ) =
    unpack( "A64 A64 A16 A32 A256", $_ );

Although fixed-length fields are always the same length, the data that is being put into a particular field may not be as long as the field. In this case, the extra space will be filled with a character not normally encountered in the data or one that can be ignored. Usually, this is a space character (ASCII 32) or a nul (ASCII 0).

In the code above, we know that the data is space-packed, and so we remove any trailing space from the name record so as not to confuse the search. This can be simply done by using the uppercase A format with unpack().

If you need to choose between delimited fields and fixed-length fields, here are a few guidelines:

The main limitations

The main limitation with delimited fields is the need to add special handling to ensure that neither the field delimiter or the record delimiter characters get added into a field value.

The main limitation with fixed-length fields is simply the fixed length. You need to check for field values being too long to fit (or just let them be silently truncated). If you need to increase a field width, then you'll have to write a special utility to rewrite your file in the new format and remember to track down and update every script that manipulates the file directly.

Space

A delimited-field file often uses less space than a fixed-length record file to store the same data, sometimes very much less space. It depends on the number and size of any empty or partially filled fields. For example, some field values, like web URLs, are potentially very long but typically very short. Storing them in a long fixed-length field would waste a lot of space.

While delimited-field files often use less space, they do "waste" space due to all the field delimiter characters. If you're storing a large number of very small fields then that might tip the balance in favor of fixed-length records.

Speed

These days, computing power is rising faster than hard disk data transfer rates. In other words, it's often worth using more space-efficient storage even if that means spending more processor time to use it.

Generally, delimited-field files are better for sequential access than fixed-length record files because the reduced size more than makes up for the increase in processing to extract the fields and handle any escaped or translated delimiter characters.

However, fixed-length record files do have a trick up their sleeve: direct access. If you want to fetch record 42,927 of a delimited-field file, you have to read the whole file and count records until you get to the one you want. With a fixed-length record file, you can just multiply 42,927 by the total record width and jump directly to the record using seek().

Furthermore, once it's located, the record can be updated in-place by overwriting it with new data. Because the new record is the same length as the old, there's no danger of corrupting the following record.

2.4.2. Inserting Data

Inserting data into a flat-file database is very straightforward and usually amounts to simply tacking the new data onto the end of the data file. For example, inserting a new megalith record into a colon-delimited file can be expressed as simply as:

#!/usr/bin/perl -w
#
# ch02/insertmegadata/insertmegadata: Inserts a new record into the 
#                                     given megalith data file as
#                                     colon-separated data
#

### Check the user has supplied an argument to scan for
###     1) The name of the file containing the data
###     2) The name of the site to insert the data for
###     3) The location of the site
###     4) The map reference of the site
###     5) The type of site
###     6) The description of the site
die "Usage: insertmegadata"
    ." <data file> <site name> <location> <map reference> <type> <description>\n"
    unless @ARGV == 6;

my $megalithFile    = $ARGV[0];
my $siteName        = $ARGV[1];
my $siteLocation    = $ARGV[2];
my $siteMapRef      = $ARGV[3];
my $siteType        = $ARGV[4];
my $siteDescription = $ARGV[5];

### Open the data file for concatenation, and die upon failure
open MEGADATA, ">>$megalithFile"
    or die "Can't open $megalithFile for appending: $!\n";

### Create a new record
my $record = join( ":", $siteName, $siteLocation, $siteMapRef,
                        $siteType, $siteDescription );

### Insert the new record into the file
print MEGADATA "$record\n"
    or die "Error writing to $megalithFile: $!\n";

### Close the megalith data file
close MEGADATA
    or die "Error closing $megalithFile: $!";

print "Inserted record for $siteName\n";

exit;

This example simply opens the data file in append mode and writes the new record to the open file. Simple as this process is, there is a potential drawback. This flat-file database does not detect the insertion of multiple items of data with the same search key. That is, if we wanted to insert a new record about Stonehenge into our megalith database, then the software would happily do so, even though a record for Stonehenge already exists.

This may be a problem from a data integrity point of view. A more sophisticated test prior to appending the data might be worth implementing to ensure that duplicate records do not exist. Combining the insert program with the query program above is a straightforward approach.

Another potential (and more important) drawback is that this system will not safely handle occasions in which more than one user attempts to add new data into the database. Since this subject also affects updating and deleting data from the database, we'll cover it more thoroughly in a later section of this chapter.

Inserting new records into a fixed-length data file is also simple. Instead of printing each field to the Perl filehandle separated by the delimiting character, we can use the pack() function to create a fixed-length record out of the data.

2.4.3. Updating Data

Updating data within a flat-file database is where things begin to get a little more tricky. When querying records from the database, we simply scanned sequentially through the database until we found the correct record. Similarly, when inserting data, we simply attached the new data without really knowing what was already stored within the database.

The main problem with updating data is that we need to be able to read in data from the data file, temporarily mess about with it, and write the database back out to the file without losing any records.

One approach is to slurp the entire database into memory, make any updates to the in-memory copy, and dump it all back out again. A second approach is to read the database in record by record, make any alterations to each individual record, and write the record immediately back out to a temporary file. Once all the records have been processed, the temporary file can replace the original data file. Both techniques are viable, but we prefer the latter for performance reasons. Slurping entire large databases into memory can be very resource-hungry.

The following short program implements the latter of these strategies to update the map reference in the database of delimited records:

#!/usr/bin/perl -w
#
# ch02/updatemegadata/updatemegadata: Updates the given megalith data file
#                                     for a given site. Uses colon-separated 
#                                     data and updates the map reference field.
#

### Check the user has supplied an argument to scan for
###     1) The name of the file containing the data
###     2) The name of the site to search for
###     3) The new map reference
die "Usage: updatemegadata <data file> <site name> <new map reference>\n"
    unless @ARGV == 3;

my $megalithFile = $ARGV[0];
my $siteName     = $ARGV[1];
my $siteMapRef   = $ARGV[2];
my $tempFile     = "tmp.$$";

### Open the data file for reading, and die upon failure
open MEGADATA, "<$megalithFile"
    or die "Can't open $megalithFile: $!\n";

### Open the temporary megalith data file for writing
open TMPMEGADATA, ">$tempFile"
    or die "Can't open temporary file $tempFile: $!\n";

### Scan through all the records looking for the desired site
while ( <MEGADATA> ) {

    ### Quick pre-check for maximum performance:
    ### Skip the record if the site name doesn't appear as a field
    next unless m/^\Q$siteName:/;
    
    ### Break up the record data into separate fields
    ### (we let $description carry the newline for us)
    my ( $name, $location, $mapref, $type, $description ) =
         split( /:/, $_ );
    
    ### Skip the record if the site name doesn't match. (Redundant after the
    ### reliable pre-check above but kept for consistency with other examples.)
    next unless $siteName eq $name;
    
    ### We've found the record to update, so update the map ref value
    $mapref = $siteMapRef;
    
    ### Construct an updated record
    $_ = join( ":", $name, $location, $mapref, $type, $description );
    
}
continue {

   ### Write the record out to the temporary file
   print TMPMEGADATA $_
      or die "Error writing $tempFile: $!\n";
}

### Close the megalith input data file
close MEGADATA;

### Close the temporary megalith output data file
close TMPMEGADATA
   or die "Error closing $tempFile: $!\n";

### We now "commit" the changes by deleting the old file...
unlink $megalithFile
   or die "Can't delete old $megalithFile: $!\n";

### and renaming the new file to replace the old one.
rename $tempFile, $megalithFile
   or die "Can't rename '$tempFile' to '$megalithFile': $!\n";

exit 0;

You can see we've flexed our Perl muscles on this example, using a while ... continue loop to simplify the logic and adding a pretest for increased speed.

An equivalent program that can be applied to a fixed-length file is very similar, except that we use a faster in-place update to change the contents of the field. This principle is similar to the in-place query described previously: we don't need to unpack and repack all the fields stored within each record, but can simply update the appropriate chunk of each record. For example:

### Scan through all the records looking for the desired site
while ( <MEGADATA> ) {

    ### Quick pre-check for maximum performance:
    ### Skip the record if the site name doesn't appear at the start
    next unless m/^\Q$siteName/;
    
    ### Skip the record if the extracted site name field doesn't match
    next unless unpack( "A64", $_ ) eq $siteName;
    
    ### Perform in-place substitution to upate map reference field
    substr( $_, 64+64, 16) =  pack( "A16", $siteMapRef ) );
    
}

This technique is faster than packing and unpacking each record stored within the file, since it carries out the minimum amount of work needed to change the appropriate field values.

You may notice that the pretest in this example isn't 100% reliable, but it doesn't have to be. It just needs to catch most of the cases that won't match in order to pay its way by reducing the number of times the more expensive unpack and field test gets executed. Okay, this might not be a very convincing application of the idea, but we'll revisit it more seriously later in this chapter.

2.4.4. Deleting Data

The final form of data manipulation that you can apply to flat-file databases is the removal, or deletion, of records from the database. We shall process the file a record at a time by passing the data through a temporary file, just as we did for updating, rather than slurping all the data into memory and dumping it at the end.

With this technique, the action of removing a record from the database is more an act of omission than any actual deletion. Each record is read in from the file, tested, and written out to the file. When the record to be deleted is encountered, it is simply not written to the temporary file. This effectively removes all trace of it from the database, albeit in a rather unsophisticated way.

The following program can be used to remove the relevant record from the delimited megalithic database when given an argument of the name of the site to delete:

#!/usr/bin/perl -w
#
# ch02/deletemegadata/deletemegadata: Deletes the record for the given
#                                     megalithic site. Uses
#                                     colon-separated data
#

### Check the user has supplied an argument to scan for
###     1) The name of the file containing the data
###     2) The name of the site to delete
die "Usage: deletemegadata <data file> <site name>\n"
    unless @ARGV == 2;

my $megalithFile  = $ARGV[0];
my $siteName      = $ARGV[1];
my $tempFile      = "tmp.$$";

### Open the data file for reading, and die upon failure
open MEGADATA, "<$megalithFile"
    or die "Can't open $megalithFile: $!\n";

### Open the temporary megalith data file for writing
open TMPMEGADATA, ">$tempFile"
    or die "Can't open temporary file $tempFile: $!\n";

### Scan through all the entries for the desired site
while ( <MEGADATA> ) {

    ### Extract the site name (the first field) from the record
    my ( $name ) = split( /:/, $_ );
    
    ### Test the sitename against the record's name
    if ( $siteName eq $name ) {
    
        ### We've found the record to delete, so skip it and move to next record
        next;
    }
    
    ### Write the original record out to the temporary file
    print TMPMEGADATA $_
        or die "Error writing $tempFile: $!\n";
    }

### Close the megalith input data file
close MEGADATA;

### Close the temporary megalith output data file
close TMPMEGADATA
    or die "Error closing $tempFile: $!\n";

### We now "commit" the changes by deleting the old file ...
unlink $megalithFile
    or die "Can't delete old $megalithFile: $!\n";

### and renaming the new file to replace the old one.
rename $tempFile, $megalithFile
    or die "Can't rename '$tempFile' to '$megalithFile': $!\n";

exit 0;

The code to remove records from a fixed-length data file is almost identical. The only change is in the code to extract the field value, as you'd expect:

### Extract the site name (the first field) from the record
my ( $name ) = unpack( "A64", $_ );

Like updating, deleting data may cause problems if multiple users are attempting to make simultaneous changes to the data. We'll look at how to deal with this problem a little later in this chapter.