SAGE - Perl Practicum - DB or Not DB?

Perl Practicum: DB or Not DB?

by Hal Pomeranz

While Perl5 is largely backwards compatible with Perl4, certain Perl4 functions have been deprecated in favor of more generic Perl5 functionality. A prime example of this is the binding of associative arrays to DBM or NDBM databases. The new Perl5 OO syntax provides a clean mechanism for binding Perl data objects to a variety of UNIX database formats.

The Brave New World

In the Perl4 world, you would normally:

        # deprecated Perl4 syntax
        dbmopen(%hash, "/home/hal/mydata",0644) ||
             die "Can't open database!\n";

This would associate %hash with the contents of the named NDBM (or DBM on older UNIXes) database. If your system did not have support from either database format, then this call generated a fatal error - not exactly "graceful degradation."

The Perl5 library now includes (again, assuming your system supports these formats).pm files for NDBM and SDBM databases. It is easy enough to add support for Berkeley DB and GNU GDBM file formats (documentation is included in the Perl5 source tree), though you will have to end up remaking Perl.

Here is a simple Perl5 function for interfacing with an NDBM file:

        use NDBM_File;
        $DBM_INSERT = 0;     # system dependent &
        $DBM_REPLACE = 1;    # not in NDBM_File.pm

        sub open_db {
             my($filename) = @_;
             tie(%HASH, NDBM_File, $filename, $DBM_REPLACE, 0644);
        }

The first line loads the NDBM library NDBM_File.pm (note that the file name is not quoted and does not include the.pm extension.) The two constants defined next should really appear in NDBM_File.pm; as a workaround, we define them near the top of our program for ease of maintenance.

The interesting piece, however, is this new tie() function in the open_db() subroutine. In Perl5, tie() is a generic function for associating a data object with a package of methods for operating on that object. If that last sentence is in no way meaningful to you, just take it on faith that this is the way we deal with NDBM style files nowadays.

In a database context, the first argument to tie() is the associative array we are going to link to the database. The next argument should be the name of the package which contains the functions for operating on the database - invariably this is the same name that we put in the use statement. The rest of the arguments are whatever is expected to be handed to the C function call which opens your database (check the manual page if you are not sure what the appropriate arguments are).

In the case of NDBM files, the arguments are a filename, a special flag value ($DBM_INSERT means you can insert new records but not overwrite existing records, $DBM_REPLACE means you can do both), and a mode value for when the database files need to be created. When an NDBM database is created, two files are actually generated: one called filename.dir and one called filename.pag.

The corresponding close_db() function is trivial:

        sub close_db {
             untie(%HASH);
        }

untie() simply breaks the link between the associative array and the database file. As a side effect, the file is closed after any changes have been flushed out to disk.

Manipulating Data

It is almost never a good idea to use keys() or values() on arrays associated with NDBM files. If the file is large, you will get back a huge list and wait a long time. If you need to iterate over an entire database, use each():

        open_db("/home/hal/mydata");
        while ((key, value) = each(%HASH)) {
             # do work here
        }
        close_db();

Note that you must never add elements to the database while you are in the middle of looping with each(). If you do, you may end up looping forever or generating a fatal error.

One of the (many) problems with NDBM style databases is that you can only associate a single data item with each key. Suppose, however, that we wanted to associate many pieces of information with a particular key. For example, suppose my database contained information about hosts at my site: for each host I would want to include its IP address, room location, owner, type of machine, and OS. One simple method would be to place a delimiter between each of these fields and use join() and split() to insert and extract information (pack() and unpack() work well if your data is of fixed size):

        # insert
        $HASH{"myserver"} = join("::", $addr, $rm, $who, $type, $os);

        # extract @info = split(/::/, $HASH{"myserver"});

If you want to be really tricky, you can marshal your data into a self-extracting format that you can eval() when you want to extract it. Here is a simple case:

        # original information
        %record = ("host" => "myserver",
             "address" => "172.16.15.1",
             "room" => "130",
             "owner" => "Bob Smith",
             "type" => "Sparc 10",
             "os" => "Solaris 2.4");

        # marshal and insert $data = "\%record=(";
        foreach $key (keys(%record)) { $data .= "'$key'=>'$record{$key}',";
        }
        $data .= ");";
        $HASH{$record{"host"}} = $data;

        # extract and print
        eval("$HASH{'myserver'}");
        foreach $key (keys(%record)) {
             print "$key\t$record{$key}\n";
        }

Note that marshalling arbitrary data, e.g., hashes with non-scalar data values, generally requires a library of recursive functions.

Locking

File locking is an issue for any serious database application. Here are basic file locking and unlocking routines that you can use:

        use Fcntl;

        sub lock_file {
             my($file, $type) = @_;
             my($flag, $struct);

             open($file, ">> $file") || return(undef);
             $flag = ($type eq "r") ? F_RDLCK : F_WRLCK;
             $struct = pack("ssx32", $flag, 0);
             return(fcntl($file, F_SETLKW, $struct));
        }

        sub unlock_file {
             my($file) = @_;
             my($struct, $status);

             $struct = pack("ssx32", F_UNLCK, 0);
             $status = fcntl($file,F_SETLKW, $struct);
             close($file);
             return($status);
        }

The careful reader will note a dirty trick happening with indirect file handles (for more information on indirect file handles consult the Perl documentation or my earlier column " Know All the Angles"). The functions themselves are straightforward: lock_file() expects a filename and a string which indicates the type of lock (r for a read lock and anything else for a write lock), while unlock_file() just takes a file name. We use fcntl() style locking since this type of locking works across networked file systems.

I recommend using a file other than the database files themselves for locking purposes. This way you avoid problems when first instantiating the database and when other processes pull the database out from under you while you are blocking for a lock. Note also that NDBM style databases have no concept row or table locking: you end up having to lock the entire NDBM file. This is another area in which these databases are inferior to modern relational database systems.

The Right Tool For the Job

For "quick and dirty" applications or for small prototypes, NDBM style databases may be the way to go. Certainly, Perl makes it natural to interface with these databases. Relational database technology will almost certainly be required for mission critical applications, but the huge connection overhead makes relational databases unsuitable for short-lived applications (CGI scripts are a good example of this class of application). You must analyze the requirements for your application, but you should try the new Perl5 syntax. It is now extremely easy to work with these database objects.

Reproduced from ;login: Vol. 20 No. 6, December 1995.

Need help? Use our Contacts page.

Last changed: May 24, 1997 pc

Perl index

Publications index

USENIX home