2.5. Putting Complex Data into Flat FilesIn our discussions of so-called "flat files" we've so far been storing, retrieving, and manipulating only that most basic of datatypes: the humble string. What can you do if you want to store more complex data, such as lists, hashes, or deeply nested data structures using references? The answer is to convert whatever it is you want to store into a string. Technically that's known as marshalling or serializing the data. The Perl Module List[11] has a section that lists several Perl modules that implement data marshalling.
We're going to take a look at two of the most popular modules, Data::Dumper and Storable, and see how we can use them to put some fizz into our flat files. These techniques are also applicable to storing complex Perl data structures in relational databases using the DBI, so pay attention. 2.5.1. The Perl Data::Dumper ModuleThe Data::Dumper module takes a list of Perl variables and writes their values out in the form of Perl code, which will recreate the original values, no matter how complex, when executed. This module allows you to dump the state of a Perl program in a readable form quickly and easily. It also allows you to restore the program state by simply executing the dumped code using eval() or do(). The easiest way to describe what happens is to show you a quick example: #!/usr/bin/perl -w # # ch02/marshal/datadumpertest: Creates some Perl variables and dumps them out. # Then, we reset the values of the variables and # eval the dumped ones ... use Data::Dumper; ### Customise Data::Dumper's output style ### Refer to Data::Dumper documentation for full details if ($ARGV[0] eq 'flat') { $Data::Dumper::Indent = 0; $Data::Dumper::Useqq = 1; } $Data::Dumper::Purity = 1; ### Create some Perl variables my $megalith = 'Stonehenge'; my $districts = [ 'Wiltshire', 'Orkney', 'Dorset' ]; ### Print them out print "Initial Values: \$megalith = " . $megalith . "\n" . " \$districts = [ ". join(", ", @$districts) . " ]\n\n"; ### Create a new Data::Dumper object from the database my $dumper = Data::Dumper->new( [ $megalith, $districts ], [ qw( megalith districts ) ] ); ### Dump the Perl values out into a variable my $dumpedValues = $dumper->Dump(); ### Show what Data::Dumper has made of the variables! print "Perl code produced by Data::Dumper:\n"; print $dumpedValues . "\n"; ### Reset the variables to rubbish values $megalith = 'Blah! Blah!'; $districts = [ 'Alderaan', 'Mordor', 'The Moon' ]; ### Print out the rubbish values print "Rubbish Values: \$megalith = " . $megalith . "\n" . " \$districts = [ ". join(", ", @$districts) . " ]\n\n"; ### Eval the file to load up the Perl variables eval $dumpedValues; die if $@; ### Display the re-loaded values print "Re-loaded Values: \$megalith = " . $megalith . "\n" . " \$districts = [ ". join(", ", @$districts) . " ]\n\n"; exit; This example simply initializes two Perl variables and prints their values. It then creates a Data::Dumper object with those values, changes the original values, and prints the new ones just to prove we aren't cheating. Finally, it evals the results of $dumper->Dump(), which stuffs the original stored values back into the variables. Again, we print it all out just to doubly convince you there's no sleight-of-hand going on: Initial Values: $megalith = Stonehenge $districts = [ Wiltshire, Orkney, Dorset ] Perl code produced by Data::Dumper: $megalith = 'Stonehenge'; $districts = [ 'Wiltshire', 'Orkney', 'Dorset' ]; Rubbish Values: $megalith = Blah! Blah! $districts = [ Alderaan, Mordor, The Moon ] Re-loaded Values: $megalith = Stonehenge $districts = [ Wiltshire, Orkney, Dorset ] So how do we use Data::Dumper to add fizz to our flat files? Well, first of all we have to ask Data::Dumper to produce flat output, that is, output with no newlines. We do that by setting two package global variables: $Data::Dumper::Indent = 0; # don't use newlines to layout the output $Data::Dumper::Useqq = 1; # use double quoted strings with "\n" escapes In our test program, we can do that by running the program with flat as an argument. Here's the relevant part of the output when we do that: $megalith = "Stonehenge";$districts = ["Wiltshire","Orkney","Dorset"]; Now we can modify our previous scan (select), insert, update, and delete scripts to use Data::Dumper to format the records instead of the join() or pack() functions we used before. Instead of split() or unpack() , we now use eval to unpack the records. Here's just the main loop of the update script we used earlier (the rest of the script is unchanged except for the addition of a use Data::Dumper; line at the top and setting the Data::Dumper variables as described above): ### Scan through all the entries for the desired site while ( <MEGADATA> ) { ### Quick pre-check for maximum performance: ### Skip the record if the site name doesn't appear next unless m/\Q$siteName/; ### Evaluate perl record string to set $fields array reference my $fields; eval $_; die if $@; ### Break up the record data into separate fields my ( $name, $location, $mapref, $type, $description ) = @$fields; ### Skip the record if the extracted site name field doesn't match next unless $siteName eq $name; ### We've found the record to update ### Create a new fields array with new map ref value $fields = [ $name, $location, $siteMapRef, $type, $description ]; ### Convert it into a line of perl code encoding our record string $_ = Data::Dumper->new( [ $fields ], [ 'fields' ] )->Dump(); $_ .= "\n"; } So, what have we gained by doing this? We avoid the tedious need to explicitly escape field delimiter characters. Data::Dumper does that for us, and there are no fixed-width field length restrictions either. The big win, though, is the ability to store practically any complex data structure, even object references. There are also some smaller benefits that may be of use to you: undefined (null ) field values can be saved and restored, and there's no need for every record to have every field defined (variant records). The downside? There's always a downside. In this case, it's mainly the extra processing time required both to dump the record data into the strings and for Perl to eval them back again. There is a version of the Data::Dumper module written in C that's much faster, but sadly it doesn't support the $Useqq variable yet. To save time processing each record, the example code has a quick precheck that skips any rows that don't at least have the desired site name somewhere in them. There's also the question of security. Because we're using eval to evaluate the Perl code embedded in our data file, it's possible that someone could edit the data file and add code that does something else, possibly harmful. Fortunately, there's a simple fix for this. The Perl ops pragma can be used to restrict the eval to compiling code that contains only simple declarations. For more information on this, see the ops documentation installed with Perl: perldoc ops 2.5.2. The Storable ModuleIn addition to Data::Dumper, there are other data marshalling modules available that you might wish to investigate, including the fast and efficient Storable. The following code takes the same approach as the example we listed for Data::Dumper to show the basic store and retrieve cycle: #!/usr/bin/perl -w # # ch02/marshal/storabletest: Create a Perl hash and store it externally. Then, # we reset the hash and reload the saved one. use Storable qw( freeze thaw ); ### Create some values in a hash my $megalith = { 'name' => 'Stonehenge', 'mapref' => 'SU 123 400', 'location' => 'Wiltshire', }; ### Print them out print "Initial Values: megalith = $megalith->{name}\n" . " mapref = $megalith->{mapref}\n" . " location = $megalith->{location}\n\n"; ### Store the values to a string my $storedValues = freeze( $megalith ); ### Reset the variables to rubbish values $megalith = { 'name' => 'Flibble Flabble', 'mapref' => 'XX 000 000', 'location' => 'Saturn', }; ### Print out the rubbish values print "Rubbish Values: megalith = $megalith->{name}\n" . " mapref = $megalith->{mapref}\n" . " location = $megalith->{location}\n\n"; ### Retrieve the values from the string $megalith = thaw( $storedValues ); ### Display the re-loaded values print "Re-loaded Values: megalith = $megalith->{name}\n" . " mapref = $megalith->{mapref}\n" . " location = $megalith->{location}\n\n"; exit; This program generates the following output, which illustrates that we are storing data persistently then retrieving it: Initial Values: megalith = Stonehenge mapref = SU 123 400 location = Wiltshire Rubbish Values: megalith = Flibble Flabble mapref = XX 000 000 location = Saturn Re-loaded Values: megalith = Stonehenge mapref = SU 123 400 location = Wiltshire Storable also has functions to write and read your data structures directly to and from disk files. It can also be used to write to a file cumulatively instead of writing all records in one atomic operation. So far, all this sounds very similar to Data::Dumper, so what's the difference? In a word, speed. Storable is fast, very fast -- both for saving data and for getting it back again. It achieves its speed partly by being implemented in C and hooked directly into the Perl internals, and partly by writing the data in its own very compact binary format. Here's our update program reimplemented yet again, this time to use Storable: #!/usr/bin/perl -w # # ch02/marshal/update_storable: Updates the given megalith data file # for a given site. Uses Storable data # and updates the map reference field. use Storable qw( nfreeze thaw ); ### Check the user has supplied an argument to scan for ### 1) The name of the file containing the data ### 2) The name of the site to search for ### 3) The new map reference die "Usage: updatemegadata <data file> <site name> <new map reference>\n" unless @ARGV == 3; my $megalithFile = $ARGV[0]; my $siteName = $ARGV[1]; my $siteMapRef = $ARGV[2]; my $tempFile = "tmp.$$"; ### Open the data file for reading, and die upon failure open MEGADATA, "<$megalithFile" or die "Can't open $megalithFile: $!\n"; ### Open the temporary megalith data file for writing open TMPMEGADATA, ">$tempFile" or die "Can't open temporary file $tempFile: $!\n"; ### Scan through all the entries for the desired site while ( <MEGADATA> ) { ### Convert the ASCII encoded string back to binary ### (pack ignores the trailing newline record delimiter) my $frozen = pack "H*", $_; ### Thaw the frozen data structure my $fields = thaw( $frozen ); ### Break up the record data into separate fields my ( $name, $location, $mapref, $type, $description ) = @$fields; ### Skip the record if the extracted site name field doesn't match next unless $siteName eq $name; ### We've found the record to update ### Create a new fields array with new map ref value $fields = [ $name, $location, $siteMapRef, $type, $description ]; ### Freeze the data structure into a binary string $frozen = nfreeze( $fields ); ### Encode the binary string as readable ASCII and append a newline $_ = unpack( "H*", $frozen ) . "\n"; } continue { ### Write the record out to the temporary file print TMPMEGADATA $_ or die "Error writing $tempFile: $!\n"; } ### Close the megalith input data file close MEGADATA; ### Close the temporary megalith output data file close TMPMEGADATA or die "Error closing $tempFile: $!\n"; ### We now "commit" the changes by deleting the old file... unlink $megalithFile or die "Can't delete old $megalithFile: $!\n"; ### and renaming the new file to replace the old one. rename $tempFile, $megalithFile or die "Can't rename '$tempFile' to '$megalithFile': $!\n"; exit 0; Since the Storable format is binary, we couldn't simply write it directly to our flat file. It would be possible for our record-delimiter character ("\n") to appear within the binary data, thus corrupting the file. We get around this by encoding the binary data as a string of pairs of hexadecimal digits. You may have noticed that we've used nfreeze() instead of freeze() . By default, Storable writes numeric data in the fastest, simplest native format. The problem is that some computer systems store numbers in a different way from others. Using nfreeze() instead of freeze() ensures that numbers are written in a form that's portable to all systems. You may also be wondering what one of these records looks like. We'll here's the record for the Castlerigg megalithic site: 0302000000050a0a436173746c6572696767580a0743756d62726961580a0a4e59203239312032 3336580a0c53746f6e6520436972636c65580aa34f6e65206f6620746865206c6f76656c696573 742073746f6e6520636972636c65732072656d61696e696e6720746f6461792e20546869732073 69746520697320636f6d707269736564206f66206c6172676520726f756e64656420626f756c64 657273207365742077697468696e2061206e61747572616c20616d706869746865617472652066 6f726d656420627920737572726f756e64696e672068696c6c732e5858 That's all on one line in the data file; we've just split it up here to fit on the page. It doesn't make for thrilling reading. It also doesn't let us do the kind of quick precheck shortcut that we used with Data::Dumper and the previous flat-file update examples. We could apply the pre-check after converting the hex string back to binary, but there's no guarantee that strings appear literally in the Storable output. They happen to now, but there's always a risk that this will change. Although we've been talking about Storable in the context of flat files, this technique is also very useful for storing arbitrary chunks of Perl data into a relational database, or any other kind of database for that matter. Storable and Data::Dumper are great tools to carry in your mental toolkit. 2.5.3. Summary of Flat-File DatabasesThe main benefit of using flat-file databases for data storage is that they can be fast to implement and fast to use on small and straightforward datasets, such as our megalithic database or a Unix password file. The code to query, insert, delete, and update information in the database is also extremely simple, with the parsing code potentially shared among the operations. You have total control over the data file formats, so that there are no situations outside your control in which the file format or access API changes. The files are also easy to read in standard text editors (although in the case of the Storable example, they won't make very interesting reading). The downsides of these databases are quite apparent. As we've mentioned already, the lack of concurrent access limits the power of such systems in a multi-user environment. They also suffer from scalability problems due to the sequential nature of the search mechanism. These limitations can be coded around (the concurrent access problem especially so), but there comes a point where you should seriously consider the use of a higher-level storage manager such as DBM files. DBM files also give you access to indexed data and allow nonsequential querying. Before we discuss DBM files in detail, the following sections give you examples of more sophisticated management tools and techniques, as well as a method of handling concurrent users. Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|