Non-Volatile Databases (Practical mod

17.2. Non-Volatile Databases

Some information is so important that you cannot afford to lose it. Consider the name and password for authenticating users. If a person registers at a site that charges a subscription fee, it would be unfortunate if his subscription details were lost the next time the web server was restarted. In this case, the information must be stored in a non-volatile way, and that usually means on disk. Several options are available, ranging from flat files to DBM files to fully-fledged relational databases. Which one you choose will depend on a number of factors, including:

The size of each record and the volume of the data to be stored

The number of concurrent accesses (to the server or even to the same data)

Data complexity (do all the records fit into one row, or are there relations between different kinds of record?)

Budget (some database implementations are great but very expensive)

Failover and backup strategies (how important it is to avoid downtime, how soon the data must be restored in the case of a system failure)

17.2.2. Filesystem Databases

Many people don't realize that in some cases, the filesystem can serve perfectly well as a database. In fact, you are probably using this kind of database every day on your PC—for example, if you store your MP3 files categorized by genres, artists, and albums. If we run:

panic% cd /data/mp3
panic% find .

We can see all the MP3 files that we have under /data/mp3:

./Rock/Bjork/MTV Unplugged/01 - Human Behaviour.mp3
./Rock/Bjork/MTV Unplugged/02 - One Day.mp3
./Rock/Bjork/MTV Unplugged/03 - Come To Me.mp3
...
./Rock/Bjork/Europa/01 - Prologue.mp3
./Rock/Bjork/Europa/02 - Hunter.mp3
...
./Rock/Nirvana/MTV Unplugged/01 - About A Girl.mp3
./Rock/Nirvana/MTV Unplugged/02 - Come As You Are.mp3
...
./Jazz/Herbie Hancock/Head Hunters/01 - Chameleon.mp3
./Jazz/Herbie Hancock/Head Hunters/02 - Watermelon Man.mp3

Now if we want to query what artists we have in the Rock genre, we just need to list the files in the Rock/ directory. Once we find out that Bjork is one of the artists in the Rock category, we can do another enquiry to find out what Bjork albums we have bought by listing the files under the Rock/Bjork/ directory. Now if we want to see the actual MP3 files from a particular album (e.g., MTV Unplugged), we list the files under that directory.

What if we want to find all the albums that have MTV in their names? We can use ls to give us all the albums and MP3 files:

panic% ls -l ./*/*/*MTV*

Of course, filesystem manipulation can be done from your Perl program.

Let's look at another example. If you run a site about rock groups, you might want to store images relating to different groups. Using the filesystem as a database is a perfect match. Chances are these images will be served to users via <img> tags, so it makes perfect sense to use the real path (DocumentRoot considerations aside) to the image. For example:

<img src="/images/rock/ACDC/cover-front.gif" alt="ACDC" ...>
<img src="/images/rock/ACDC/cover-back.gif"  alt="ACDC" ...>

In this example we treat ACDC as a record and cover-front.gif and cover-back.gif as fields. This database implementation, just like the flat-file database, has no special benefits under mod_perl, so we aren't going to expand on the idea, but it's worth keeping in mind.

Too Many Files

There is one thing to beware of: in some operating systems, when too many files (or directories) are stored in a single directory, access can be sluggish. It depends on the filesystem you are using. If you have a few files, simple linear access will be good enough. Many filesystems employ hashing algorithms to store the i-nodes (files or directories) of a directory. You should check your filesystem documentation to see how it will behave under load.

If you find that you have lots of files to store and the filesystem implementation won't work too well for you, you can implement your own scheme by spreading the files into an extra layer or two of subdirectories. For example, if your filenames are numbers, you can use something like the following function:
my $dir = join "/", (split '', sprintf "%02d", $id)[0..1], $id;
So if you want to create a directory 12345, it will be converted into 1/2/12345. The directory 12 could become 0/0/12, and 124 could become 0/1/124. If your files have a reasonable distribution, which is often true with numerical data, you might end up with two-llevel hashing. So if you have 10,000 directories to create, each end-level directory will have at most about 100 subdirectories, which is probably good enough for a fast lookup. If you are going to have many more files you may need to think about adding more levels.

Also remember that the more levels you add, the more overhead you are adding, since the OS has to search through all the intermediate directories that you have added. Only do that if you really need to. If you aren't sure, and you start with a small number of directories, abstract the resolution of the directories so that in the future you can switch to a hashed implementation or add more levels to the existing one.

17.2. Non-Volatile Databases

17.2.1. Flat-File Databases

17.2.2. Filesystem Databases

Too Many Files

17.2.3. DBM Databases

17.2.4. Relational Databases