DBM Files and the BerkeleyDatabase Manager (Programming the Perl DBI)

In addition to the standard DBM file features, Berkeley DB and the DB_File module also provide support for several different storage and retrieval algorithms that can be used in subtly different situations. In newer versions of the software, concurrent access to databases and locking are also supported.

2.7.3. Inserting and Retrieving Values

Inserting data into a Berkeley DB using the Perl DB_File module is extremely simple as a result of using a tied hash or tied array . The association of a DBM file and a Perl data structure is created when the database is opened. This allows us to manipulate the contents of the database simply by altering the contents of the Perl data structures.

This system makes it very easy to store data within a DBM file and also abstracts the actual file-related operations for data manipulation away from our scripts. Thus, the Berkeley DB is a higher-level storage manager than the simple flat-file databases discussed earlier in this chapter.

The following script demonstrates the insertion and retrieval of data from a DBM file using a tied hash. This hash has the Perl characteristic of being a key/value pair. That is, values are stored within the hash table against a unique key. This affords extremely fast retrieval and an element of indexed data access as opposed to sequential access. For example:

#!/usr/bin/perl -w
#
# ch02/DBM/simpleinsert: Creates a Berkeley DB, inserts some test data
#                        and dumps it out again

use DB_File;
use Fcntl ':flock';

### Initialize the Berkeley DB
my %database;
my $db = tie %database, 'DB_File', "simpleinsert.dat",
         O_CREAT | O_RDWR, 0666
    or die "Can't initialize database: $!\n";

my $fd = $db->fd();
open DATAFILE, "+<&=$fd"
    or die "Can't safely open file: $!\n";
print "Acquiring exclusive lock...";
flock( DATAFILE, LOCK_EX )
    or die "Unable to acquire lock: $!. Aborting";
print "Acquired lock. Ready to update database!\n\n";


### Insert some data rows
$database{'Callanish I'} = 
    "This site, commonly known as the "Stonehenge of the North" is in the
form of a buckled Celtic cross.";

$database{'Avebury'} =
    "Avebury is a vast, sprawling site that features, amongst other marvels,
the largest stone circle in Britain. The henge itself is so large,
it almost completely surrounds the village of Avebury.";
$database{'Lundin Links'} =
    "Lundin Links is a megalithic curiosity, featuring 3 gnarled and
immensely tall monoliths arranged possibly in a 4-poster design.
Each monolith is over 5m tall.";

### Untie the database
undef $db;
untie %database;

### Close the file descriptor to release the lock
close DATAFILE;


### Retie the database to ensure we're reading the stored data
$db = tie %database, 'DB_File', "simpleinsert.dat", O_RDWR, 0444
    or die "Can't initialize database: $!\n";

### Only need to lock in shared mode this time because we're not updating ...
$fd = $db->fd();
open DATAFILE, "+<&=$fd" or die "Can't safely open file: $!\n";
print "Acquiring shared lock...";
flock( DATAFILE, LOCK_SH )
    or die "Unable to acquire lock: $!. Aborting";
print "Acquired lock. Ready to read database!\n\n";

### Dump the database
foreach my $key ( keys %database ) {
    print "$key\n", ( "=" x ( length( $key ) + 1 ) ), "\n\n";
    print "$database{$key}\n\n";
}

### Close the Berkeley DB
undef $db;
untie %database;

### Close the file descriptor to release the lock
close DATAFILE;

exit;

When run, this script will generate the following output, indicating that it is indeed retrieving values from a database:

Acquiring exclusive lock...Acquired lock. Ready to update database!

Acquiring shared lock...Acquired lock. Ready to read database!

Callanish I
============

This site, commonly known as the "Stonehenge of the North" is in the
form of a buckled Celtic cross.

Avebury
========

Avebury is a vast, sprawling site that features, amongst other marvels,
the largest stone circle in Britain. The henge itself is so large,
it almost completely surrounds the village of Avebury.

Lundin Links
=============

Lundin Links is a megalithic curiosity, featuring 3 gnarled and
immensely tall monoliths arranged possibly in a 4-poster design.
Each monolith is over 5m tall.

You may have noticed that we cheated a little bit in the previous example. We stored only the descriptions of the sites instead of all the information such as the map reference and location. This is the inherent problem with key/value pair databases: you can store only a single value against a given key. You can circumvent this by simply concatenating values into a string and storing that string instead, just like we did using join(), pack(), Data::Dumper, and Storable earlier in this chapter.

This particular form of storage jiggery-pokery can be accomplished in at least two ways.[17] One is to hand-concatenate the data into a string and hand-split it when required. The other is slightly more sophisticated and uses a Perl object encapsulating a megalith to handle, and hide, the packing and unpacking.

[17]As with all Perl things, There's More Than One Way To Do It (a phrase so common with Perl you'll often see it written as TMTOWTDI). We're outlining these ideas here because they dawned on us first. You might come up with something far more outlandish and obscure, or painfully obvious. Such is Perl.

2.7.3.1. Localized storage and retrieval

The first technique -- application handling of string joins and splits -- is certainly the most self-contained. This leads us into a small digression.

Self-containment can be beneficial, as it tends to concentrate the logic of a script internally, making things slightly more simple to understand. Unfortunately, this localization can also be a real pain. Take our megalithic database as a good example. In the previous section, we wrote four different Perl scripts to handle the four main data manipulation operations. With localized logic, you're essentially implementing the same storing and extraction code in four different places.

Furthermore, if you decide to change the format of the data, you need to keep four different scripts in sync. Given that it's also likely that you'll add more scripts to perform more specific functions (such as generating web pages) with the appropriate megalithic data from the database, that gives your database more points of potential failure and inevitable corruption.

Getting back to the point, we can fairly simply store complex data in a DBM file by using either join( ) , to create a delimited string, or pack( ) , to make a fixed-length record. join( ) can be used in the following way to produce the desired effect:

### Insert some data rows
$database{'Callanish I'} =
    join( ':', 'Callanish I', 'Callanish, Western Isles', 'NB 213 330',
               'Stone Circle', 'Description of Callanish I' );
$database{'Avebury'} =
    join( ':', 'Avebury', 'Wiltshire', 'SU 103 700', 
               'Stone Circle and Henge',
               'Description of Avebury' );
$database{'Lundin Links'} =
    join( ':', 'Lundin Links', 'Fife', 'NO 404 027', 'Standing Stones',
               'Description of Lundin Links' );
### Dump the database
foreach my $key ( keys %database ) {
    my ( $name, $location, $mapref, $type, $description ) =
         split( /:/, $database{$key} );
    print "$name\n", ( "=" x length( $name ) ), "\n\n";
    print "Location:      $location\n";
    print "Map Reference: $mapref\n";
    print "Description:   $description\n\n";
}

The storage of fixed-length records is equally straightforward, but does gobble up space within the database rather quickly. Furthermore, the main rationale for using fixed-length records is often access speed, but when stored within a DBM file, in-place queries and updates simply do not provide any major speed increase.

The code to insert and dump megalithic data using fixed-length records is shown in the following code segment:

### The pack and unpack template.
$PACKFORMAT = 'A64 A64 A16 A32 A256';

### Insert some data rows
$database{'Callanish I'} =
    pack( $PACKFORMAT, 'Callanish I', 'Callanish, Western Isles', 
                     'NB 213 330', 'Stone Circle', 
                     'Description of Callanish I' );

$database{'Avebury'} =
    pack( $PACKFORMAT, 'Avebury', 'Wiltshire', 'SU 103 700', 
                     'Stone Circle and Henge', 'Description of Avebury' );

$database{'Lundin Links'} =
    pack( $PACKFORMAT, 'Lundin Links', 'Fife', 'NO 404 027', 
                     'Standing Stones', 
                     'Description of Lundin Links' );

### Dump the database
foreach my $key ( keys %database ) {
    my ( $name, $location, $mapref, $type, $description ) =
        unpack( $PACKFORMAT, $database{$key} );
    print "$name\n", ( "=" x length( $name ) ), "\n\n";
    print "Location:      $location\n";
    print "Map Reference: $mapref\n";
    print "Description:   $description\n\n";
}

The actual code to express the storage and retrieval mechanism isn't really much more horrible than the delimited record version, but it does introduce a lot of gibberish in the form of the pack( ) template, which could easily be miskeyed or forgotten about. This also doesn't really solve the problem of localized program logic, and turns maintenance into the aforementioned nightmare.

How can we improve on this?

2.7.3.2. Packing in Perl objects

One solution to both the localized code problem and the problem of storing multiple data values within a single hash key/value pair is to use a Perl object to encapsulate and hide some of the nasty bits.[18]

[18]This is where people tend to get a little confused about Perl. The use of objects, accessor methods, and data hiding are all very object-oriented. By this design, we get to mix the convenience of non-OO programming with the neat bits of OO programming. Traditional OO programmers have been known to make spluttering noises when Perl programmers discuss this sort of thing in public.

The following Perl code defines an object of class Megalith. We can then reuse this packaged object module in all of our programs without having to rewrite any of them, if we change the way the module works:

#!/usr/bin/perl -w
  #
  # ch02/DBM/Megalith.pm: A perl class encapsulating a megalith
  
  package Megalith;
  
  use strict;
  use Carp;
  
  ### Creates a new megalith object and initializes the member fields.
  sub new {
      my $class = shift;
      my ( $name, $location, $mapref, $type, $description ) = @_;
      my $self = {};
      bless $self => $class;
      
      ### If we only have one argument, assume we have a string
      ### containing all the field values in $name and unpack it
      if ( @_ == 1 ) {
          $self->unpack( $name );
      }
      else {
          $self->{name} = $name;
          $self->{location} = $location;
          $self->{mapref} = $mapref;
          $self->{type} = $type;
          $self->{description} = $description;
      }
      return $self;
  }
  
  ### Packs the current field values into a colon delimited record
  ### and returns it
  sub pack {
      my ( $self ) = @_;
      
      my $record = join( ':', $self->{name}, $self->{location},
                              $self->{mapref}, $self->{type},
                              $self->{description} );
      
      ### Simple check that fields don't contain any colons
      croak "Record field contains ':' delimiter character"
          if $record =~ tr/:/:/ != 4;
      
      return $record;
  }
  
  ### Unpacks the given string into the member fields
  sub unpack {
      my ( $self, $packedString ) = @_;
      
      ### Naive split...Assumes no inter-field delimiters
      my ( $name, $location, $mapref, $type, $description ) =
           split( ':', $packedString, 5 );
      
      $self->{name} = $name;
      $self->{location} = $location;
      $self->{mapref} = $mapref;
      $self->{type} = $type;
      $self->{description} = $description;
  }
  
  ### Displays the megalith data
  sub dump {
      my ( $self ) = @_;
      
      print "$self->{name} ( $self->{type} )\n", 
            "=" x ( length( $self->{name} ) + 
                    length( $self->{type} ) + 5 ), "\n";
      print "Location:      $self->{location}\n"; 
      print "Map Reference: $self->{mapref}\n";
      print "Description:   $self->{description}\n\n";
  }
  
  1;

The record format defined by the module contains the items of data pertaining to each megalithic site that can be queried and manipulated by programs. A new Megalith object can be created from Perl via the new operator, for example:

### Create a new object encapsulating Stonehenge
$stonehenge = 
    new Megalith( 'Stonehenge', 'Description of Stonehenge',
                  'Wiltshire', 'SU 123 400' );

### Display the name of the site stored within the object ...
print "Name: $stonehenge->{name}\n";

It would be extremely nice if these Megalith objects could be stored directly into a DBM file. Let's try a simple piece of code that simply stuffs the object into the hash:

### Create a new object encapsulating Stonehenge
$stonehenge = 
    new Megalith( 'Stonehenge', 'Description of Stonehenge',
                  'Wiltshire', 'SU 123 400' );

### Store the object within the database hash
$database{'Stonehenge'} = $stonehenge;

### Have a look at the entry within the database
print "Key: $database{'Stonehenge'}\n";

This generates some slightly odd results, to say the least:

Key: Megalith=HASH(0x80e9aec)

What appears to have happened is that the string describing the reference to the Perl object has been inserted in the Berkeley DB instead of the object itself!

This result is perhaps not surprising, given that the DBM systems are really designed for storing single string values, and there is no innate understanding of how to compact complex objects into a single value. It simply converts all keys and values into strings.

Fortunately, the problem of storing a Perl object can be routed around by packing , or marshalling, all the values of all the Megalith object's fields into a single string, and then inserting that string into the database. Similarly, upon extracting the string from the database, a new Megalith can be allocated and populated by unpacking the string into the appropriate fields.

By using our conveniently defined Megalith class, we can write the following code to do this (note the calling of the pack() method):

$database{'Callanish I'} =
    new Megalith( 'Callanish I', 
                  'Western Isles',
                  'NB 213 330',
                  'Stone Circle',
                  'Description of Callanish I' )->pack(  );

### Dump the database
foreach $key ( keys %database ) {

    ### Unpack the record into a new megalith object
    my $megalith = new Megalith( $database{$key} );

    ### And display the record
    $megalith->dump(  );
}

The Megalith object has two methods declared within it called pack( ) and unpack( ) . These simply pack all the fields into a single delimited string, and unpack a single string into the appropriate fields of the object as needed. If a Megalith object is created with one of these strings as the sole argument, unpack( ) is called internally, shielding the programmer from the internal details of storage management.

Similarly, the actual way in which the data is packed and unpacked is hidden from the module user. This means that if any database structural changes need to be made, they can be made internally without any maintenance on the database manipulation scripts themselves.

If you read the section on putting complex data into flat files earlier in the chapter, then you'll know that there's more than one way to do it.

So although it's a little more work at the outset, it is actually quite straightforward to store Perl objects (and other complex forms of data) within DBM files.

2.7.3.3. Object accessor methods

A final gloss on the Megalith class would be to add accessor methods to allow controlled access to the values stored within each object. That is, the example code listed above contains code that explicitly accesses member variables within the object:

print "Megalith Name: $megalith->{name}\n";

This may cause problems if the internal structure of the Megalith object alters in some way. Also, if you write $megalith->{nme} by mistake, no errors or warnings will be generated. Defining an accessor method called getName( ), such as:

### Returns the name of the megalith
sub getName {
    my ( $self ) = @_;
    return $self->{name};
}

makes the code arguably more readable:

print "Megalith Name: " . $megalith->getName(  ) . "\n";

and also ensures the correctness of the application code, since the actual logic is migrated, once again, into the object.

2.7.3.4. Querying limitations of DBM files and hashtables

Even with the functionality of being able to insert complex data into the Berkeley DB file (albeit in a slightly roundabout way), there is still a fundamental limitation of this database software: you can retrieve values via only one key. That is, if you wanted to search our megalithic database, the name, not the map reference or the location, must be used as the search term.

This might be a pretty big problem, given that you might wish to issue a query such as, ``tell me about all the sites in Wiltshire,'' without specifying an exact name. In this case, every record would be tested to see if any fit the bill. This would use a sequential search instead of the indexed access you have when querying against the key.

A solution to this problem is to create secondary referential hashes that have key values for the different fields you might wish to query on. The value stored for each key is actually a reference to the original hash and not to a separate value. This allows you to update the value in the original hash, and the new value is automatically mirrored within the reference hashes. The following snippet shows some code that could be used to create and dump out a referential hash keyed on the location of a megalithic site:

### Build a referential hash based on the location of each monument
$locationDatabase{'Wiltshire'}     = \$database{'Avebury'};
$locationDatabase{'Western Isles'} = \$database{'Callanish I'};
$locationDatabase{'Fife'}          = \$database{'Lundin Links'};

### Dump the location database
foreach $key ( keys %locationDatabase ) {

    ### Unpack the record into a new megalith object
    my $megalith = new Megalith( ${ $locationDatabase{$key} } );

    ### And display the record
    $megalith->dump(  );
}

There are, of course, a few drawbacks to this particular solution. The most apparent is that any data deletion or insertion would require a mirror operation to be performed on each secondary reference hash.

The biggest problem with this approach is that your data might not have unique keys. If we wished to store records for Stonehenge and Avebury, both of those sites have a location of Wiltshire. In this case, the latest inserted record would always overwrite the earlier records inserted into the hash. To solve this general problem, we can use a feature of Berkeley DB files that allows value chaining.

2.7.3.5. Chaining multiple values into a hash

One of the bigger problems when using a DBM file with the storage mechanism of DB_HASH is that the keys against which the data is stored must be unique. For example, if we stored two different values with the key of ``Wiltshire,'' say for Stonehenge and Avebury, generally the last value inserted into the hash would get stored in the database. This is a bit problematic, to say the least.

In a good database design, the primary key of any data structure generally should be unique in order to speed up searches. But quick and dirty databases, badly designed ones, or databases with a suboptimal data quality may not be able to enforce this uniqueness. Similarly, using referential hashtables to provide nonprimary key searching of the database also triggers this problem.

A Perl solution to this problem is to push the multiple values onto an array that is stored within the hash element. This technique works fine while the program is running, because the array references are still valid, but when the database is written out and reloaded, the data is invalid.

Therefore, to solve this problem, we need to look at using the different Berkeley DB storage management method of DB_BTREE , which orders its keys prior to insertion. With this mechanism, it is possible to have duplicate keys, because the underlying DBM file is in the form of an array rather than a hashtable. Fortunately, you still reference the DBM file via a Perl hashtable, so DB_BTREE is not any harder to use. The main downside to DB_BTREE storage is a penalty in performance, since a B-Tree is generally slightly slower than a hashtable for data retrieval.

The following short program creates a Berkeley DB using the DB_BTREE storage mechanism and also specifies a flag to indicate that duplicate keys are allowed. A number of rows are inserted with duplicate keys, and finally the database is dumped to show that the keys have been stored:

#!/usr/bin/perl -w
  #
  # ch02/DBM/dupkey1: Creates a Berkeley DB with the DB_BTREE mechanism and 
  #                   allows for duplicate keys. We then insert some test 
  #                   object data with duplicate keys and dump the final
  #                   database.
  
  use DB_File;
  use Fcntl ':flock';
  use Megalith;
  
  ### Set Berkeley DB BTree mode to handle duplicate keys
  $DB_BTREE->{'flags'} = R_DUP;
  
  ### Remove any existing database files
  unlink 'dupkey2.dat';
  
  ### Open the database up
  my %database;
  my $db = tie %database, 'DB_File', "dupkey2.dat", 
                 O_CREAT | O_RDWR, 0666, $DB_BTREE
      or die "Can't initialize database: $!\n";
  
  ### Exclusively lock the database to ensure no one accesses it
  my $fd = $db->fd(  );
  open DATAFILE, "+<&=$fd"
      or die "Can't safely open file: $!\n";
  print "Acquiring exclusive lock...";
  flock( DATAFILE, LOCK_EX )
      or die "Unable to acquire lock: $!. Aborting";
  print "Acquired lock. Ready to update database!\n\n";
  
  ### Create, pack and insert some rows with duplicate keys
  $database{'Wiltshire'} = 
    new Megalith( 'Avebury',
                  'Wiltshire',
                  'SU 103 700',
                  'Stone Circle and Henge',
                  'Largest stone circle in Britain' )->pack(  );
  
  $database{'Wiltshire'} =
    new Megalith( 'Stonehenge',
                  'Wiltshire',
                  'SU 123 400',
                  'Stone Circle and Henge',
                  'The most popularly known stone circle in the world' )->pack(  );
  
  $database{'Wiltshire'} =
    new Megalith( 'The Sanctuary',
                  'Wiltshire',
                  'SU 118 680',
                  'Stone Circle ( destroyed )',
                  'No description available' )->pack(  );
  
  ### Dump the database
  foreach my $key ( keys %database ) {
  
      ### Unpack the record into a new megalith object
      my $megalith = new Megalith( $database{$key} );
      
      ### And display the record
      $megalith->dump(  );
  }
  
  ### Close the database
  undef $db;
  untie %database;
  
  ### Close the filehandle to release the lock
  close DATAFILE;
  
  exit;

The output you get from running this program is not exactly what we'd hoped for:

Acquiring exclusive lock...Acquired lock. Ready to update database!
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available

It seems that we've managed to successfully store three copies of the same record instead of three different records!

Fortunately, this isn't actually the case. We have correctly stored the three different records with the same key in the DBM file. The problem lies in the way we've tried to read these records back out of the DBM file. A basic dereference using the hash key obviously doesn't work, since Perl stores only a single value for each key, as we already know.

To get around this limitation, we can use the seq( ) method declared within the DB_File module, which is used to traverse chained records stored within a single hash element. Figure 2-1 illustrates the principle of chained record traversal within a hash element.

Figure 2-1. Chained record traversal

The corrected record dumping chunk is rewritten to use seq() in this way:

### Dump the database
my ($key, $value, $status) = ('', '', 0);
for ( $status = $db->seq( $key, $value, R_FIRST ) ;
      $status == 0 ;
      $status = $db->seq( $key, $value, R_NEXT ) ) {

    ### Unpack the record into a new megalith object
    my $megalith = new Megalith( $value );

    ### And display the record
    $megalith->dump();
}

Running this corrected version produces the output we expected, i.e., records for three different megalithic sites.

The seq() method is quite simple to use and understand, and it works well when used in conjunction with a for loop, as shown above. The method takes three arguments: the hash key, the hash value, and a flag signifying which element within the chain should be returned. The first two arguments are actually populated with the hash key and the correct hash value, respectively, when seq() is called. Exactly which hash value is returned depends on the value of the third argument:

R_FIRST returns the first record within the chain of records.
R_LAST returns the last record stored within the chain of records.
R_NEXT returns the next record within the chain of records. This is used for forward sequential traversals of the chain.
R_PREV returns the previous record within the chain of records. This is used for backward sequential traversals of the chain.
R_CURSOR returns a record in which a partial match for the key has been located. This allows a certain element of "fuzzy matching" of keys. This feature is not necessarily accurate and may return the closest match to the desired key rather than an exact match. For example, if you searched for all sites within Wiltshire and asked for a partial match against ``wilt'', but no exact records matched, you may be returned the entries for the ``Western Isles,'' as these are the closest to the search term.

In the database dumping example shown above, we are simply starting at the beginning of the record chain and traversing through it in a forward direction. We could have performed a backward search by writing:

for ( $status = $db->seq( $key, $value, R_LAST ) ;
      $status == 0 ;
      $status = $db->seq( $key, $value, R_PREV ) ) {
    ...
}

A quicker and easier utility method for querying duplicate values also exists: get_dup() . This method returns either the number of records with the given key or an array or hash containing the appropriate records. For example, given that we have three records in our database with the key of Wiltshire, we could verify that fact by writing:

### Displays the number of records inserted against 
### the "Wiltshire" key
my $numRecords = $db->get_dup( 'Wiltshire' );
print "Number of Wiltshire records: $numRecords\n";

Table 2-1. The Default Argument Values of DB_File

2.7.3.1. Localized storage and retrieval

2.7.3.2. Packing in Perl objects

2.7.3.3. Object accessor methods

2.7.3.4. Querying limitations of DBM files and hashtables

2.7.3.5. Chaining multiple values into a hash

Figure 2-1. Chained record traversal

2.7. DBM Files and the BerkeleyDatabase Manager

2.7.1. Creating a New Database

2.7.2. Locking Strategies

2.7.3. Inserting and Retrieving Values

2.7.4. Deleting Values