2.4. Flat-File DatabasesThe simplest type of database that we can create and manipulate is the old standby, the flat-file database. This database is essentially a file, or group of files, that contains data in a known and standard format that a program scans for the requested information. Modifications to the data are usually done by updating an in-memory copy of the data held in the file, or files, then writing the entire set of data back out to disk. Flat-file databases are typically ASCII text files containing one record of information per line. The line termination serves as the record delimiter. In this section we'll be examining the two main types of flat-file database: files that separate fields with a delimiter character, and files that allocate a fixed length to each field. We'll discuss the pros and cons of each type of data file and give you some example code for manipulating them. The most common format used for flat-file databases is probably the delimited file in which each field is separated by a delimiting character. And possibly the most common of these delimited formats is the comma-separated values (CSV) file, in which fields are separated from one another by commas. This format is understood by many common programs, such as Microsoft Access and spreadsheet programs. As such, it is an excellent base-level and portable format useful for sharing data between applications.[8]
Other popular delimiting characters are the colon ( : ), the tab, and the pipe symbol ( | ). The Unix /etc/passwd file is a good example of a delimited file with each record being separated by a colon. Figure 2-1 shows a single record from an /etc/passwd file. ![]() Figure 2-1. The /etc/passwd file record format2.4.1. Querying DataSince delimited files are a very low-level form of storage manager, any manipulations that we wish to perform on the data must be done using operating system functions and low-level query logic, such as basic string comparisons. The following program illustrates how we can open a data file containing colon-separated records of megalith data, search for a given site, and return the data if found:
For example, running that program with a file containing a record in the following format: Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:The most famous stone circle and a search term of Stonehenge would return the following information: Located site: Stonehenge on line 1 Information on Stonehenge ( Stone Circle and Henge ) ==================================================== Location: Wiltshire Map Reference: SU 123 400 Description: The most famous stone circle indicating that our brute-force scan and test for the correct site has worked. As you can clearly see from the example program, we have used Perl's own native file I/O functions for reading in the data file, and Perl's own string handling functions to break up the delimited data and test it for the correct record. The downside to delimited file formats is that if any piece of data contains the delimiting character, you need to be especially careful not to break up the records in the wrong place. Using the Perl split() function with a simple regular expression, as used above, does not take this into account and could produce wrong results. For example, a record containing the following information would cause the split() to happen in the wrong place: Stonehenge:Wiltshire:SU 123 400:Stone Circle and Henge:Stonehenge: The most famous stone circle The easiest quick-fix technique is to translate any delimiter characters in the string into some other character that you're sure won't appear in your data. Don't forget to do the reverse translation when you fetch the records back. Another common way of storing data within flat files is to use fixed-length records in which to store the data. That is, each piece of data fits into an exactly sized space in the data file. In this form of database, no delimiting character is needed between the fields. There's also no need to delimit each record, but we'll continue to use ASCII line termination as a record delimiter in our examples because Perl makes it very easy to work with files line by line. Using fixed-width fields is similar to the way in which data is organized in more powerful database systems such as an RDBMS. The pre-allocation of space for record data allows the storage manager to make assumptions about the layout of the data on disk and to optimize accordingly. For our megalithic data purposes, we could settle on the data sizes of:[10]
Field Required Bytes ----- -------------- Name 64 Location 64 Map Reference 16 Type 32 Description 256 Storing the data in this format requires slightly different storage manager logic to be used, although the standard Perl file I/O functions are still applicable. To test this data for the correct record, we need to implement a different way of extracting the fields from within each record. For a fixed-length data file, the Perl function unpack() is perfect. The following code shows how the unpack() function replaces the split() used above:
Although fixed-length fields are always the same length, the data that is being put into a particular field may not be as long as the field. In this case, the extra space will be filled with a character not normally encountered in the data or one that can be ignored. Usually, this is a space character (ASCII 32) or a nul (ASCII 0). In the code above, we know that the data is space-packed, and so we remove any trailing space from the name record so as not to confuse the search. This can be simply done by using the uppercase A format with unpack(). If you need to choose between delimited fields and fixed-length fields, here are a few guidelines:
2.4.2. Inserting DataInserting data into a flat-file database is very straightforward and usually amounts to simply tacking the new data onto the end of the data file. For example, inserting a new megalith record into a colon-delimited file can be expressed as simply as:
This example simply opens the data file in append mode and writes the new record to the open file. Simple as this process is, there is a potential drawback. This flat-file database does not detect the insertion of multiple items of data with the same search key. That is, if we wanted to insert a new record about Stonehenge into our megalith database, then the software would happily do so, even though a record for Stonehenge already exists. This may be a problem from a data integrity point of view. A more sophisticated test prior to appending the data might be worth implementing to ensure that duplicate records do not exist. Combining the insert program with the query program above is a straightforward approach. Another potential (and more important) drawback is that this system will not safely handle occasions in which more than one user attempts to add new data into the database. Since this subject also affects updating and deleting data from the database, we'll cover it more thoroughly in a later section of this chapter. Inserting new records into a fixed-length data file is also simple. Instead of printing each field to the Perl filehandle separated by the delimiting character, we can use the pack() function to create a fixed-length record out of the data. 2.4.3. Updating DataUpdating data within a flat-file database is where things begin to get a little more tricky. When querying records from the database, we simply scanned sequentially through the database until we found the correct record. Similarly, when inserting data, we simply attached the new data without really knowing what was already stored within the database. The main problem with updating data is that we need to be able to read in data from the data file, temporarily mess about with it, and write the database back out to the file without losing any records. One approach is to slurp the entire database into memory, make any updates to the in-memory copy, and dump it all back out again. A second approach is to read the database in record by record, make any alterations to each individual record, and write the record immediately back out to a temporary file. Once all the records have been processed, the temporary file can replace the original data file. Both techniques are viable, but we prefer the latter for performance reasons. Slurping entire large databases into memory can be very resource-hungry. The following short program implements the latter of these strategies to update the map reference in the database of delimited records:
You can see we've flexed our Perl muscles on this example, using a while ... continue loop to simplify the logic and adding a pretest for increased speed. An equivalent program that can be applied to a fixed-length file is very similar, except that we use a faster in-place update to change the contents of the field. This principle is similar to the in-place query described previously: we don't need to unpack and repack all the fields stored within each record, but can simply update the appropriate chunk of each record. For example:
This technique is faster than packing and unpacking each record stored within the file, since it carries out the minimum amount of work needed to change the appropriate field values. You may notice that the pretest in this example isn't 100% reliable, but it doesn't have to be. It just needs to catch most of the cases that won't match in order to pay its way by reducing the number of times the more expensive unpack and field test gets executed. Okay, this might not be a very convincing application of the idea, but we'll revisit it more seriously later in this chapter. 2.4.4. Deleting DataThe final form of data manipulation that you can apply to flat-file databases is the removal, or deletion, of records from the database. We shall process the file a record at a time by passing the data through a temporary file, just as we did for updating, rather than slurping all the data into memory and dumping it at the end. With this technique, the action of removing a record from the database is more an act of omission than any actual deletion. Each record is read in from the file, tested, and written out to the file. When the record to be deleted is encountered, it is simply not written to the temporary file. This effectively removes all trace of it from the database, albeit in a rather unsophisticated way. The following program can be used to remove the relevant record from the delimited megalithic database when given an argument of the name of the site to delete:
The code to remove records from a fixed-length data file is almost identical. The only change is in the code to extract the field value, as you'd expect: ### Extract the site name (the first field) from the record my ( $name ) = unpack( "A64", $_ ); Like updating, deleting data may cause problems if multiple users are attempting to make simultaneous changes to the data. We'll look at how to deal with this problem a little later in this chapter.
Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|
|