NAME

FsDB - Use the filesystem as a DB

SYNOPSIS

use FsDB;

my %hash;
tie %hash, 'FsDB', "mydb";
$hash{ $key } = $value;

# If you are creating multiple thousands of entries:
tie %hash, 'FsDB', { dir=>"mydb", depth=>1 };

DESCRIPTION

FsDB uses the filesystem as a DBM or more correctly, a persistent key-value store. FsDB will create a file for each value stored in the DBM. The name of the file is a hash of the key.

FsDB uses a directory per database instead of a file. Each value is stored in a unique file. The unique filename is created by hashing the key with "murmur32" in Digest::MurmurHash3. The value is stringified and stored in the file. The opposite operations are done to retrieve a value: the unique filename is created with the hash, the file is read and the value is returned.

FsDB is not intended to be portable nor distributed.

FsDB's defaults (depth=0) are intended for situations where you have a few keys, less then a few thousand. An example application would be persistent session information in a web application. Each session would be its own database (aka directory) and rarely will you need to write thousands of distinct keys.

The limit is what your filesystem can handle easily. For example, ext4 is fine up to a few thousand entries. Other filesystems will have different performance limits.

FsDB is surprisingly fast:

Benchmark: timing 5000 iterations of BerkeleyDB, DB_File, FsDB,
                FsDB;depth=1, FsDB;depth=1;primed, GDBM_File, QDBM_File, SQLite_File...
         BerkeleyDB: 62 wallclock secs ( 0.93 usr +  1.23 sys =  2.16 CPU) @ 2314.81/s (n=5000)
            DB_File: 61 wallclock secs ( 0.77 usr +  1.17 sys =  1.94 CPU) @ 2577.32/s (n=5000)
               FsDB:  3 wallclock secs ( 1.47 usr +  1.11 sys =  2.58 CPU) @ 1937.98/s (n=5000)
       FsDB;depth=1:  8 wallclock secs ( 1.80 usr +  1.07 sys =  2.87 CPU) @ 1742.16/s (n=5000)
FsDB;depth=1;primed:  3 wallclock secs ( 1.74 usr +  1.06 sys =  2.80 CPU) @ 1785.71/s (n=5000)
          GDBM_File: 44 wallclock secs ( 0.42 usr +  1.76 sys =  2.18 CPU) @ 2293.58/s (n=5000)
          QDBM_File:  1 wallclock secs ( 0.11 usr +  0.16 sys =  0.27 CPU) @ 18518.52/s (n=5000)
            (warning: too few iterations for a reliable count)
       SQLite_File: 125 wallclock secs ( 5.76 usr +  2.76 sys =  8.52 CPU) @ 586.85/s (n=5000)

The above Benchmarks were run on a VM with a ext4 filesystem, qcow2 disk image. The host uses ext4 and NVMe. None of which really matters as the operations are small enough to stay in memory buffers/cache.

Small rant

The use of DB or DBM in this module and others like it (DB_File Berkeley_DB, DBM_File) is misleading. They are in fact persistent key-value stores.

METHODS

TIEHASH

my %hash
tie %hash, 'FsDB', \%params;    
tie %hash, 'FsDB', $dir, [IGNORED]; # compatible with DB_File et al

The first form is prefered. The 2nd form makes FsDB a drop-in replacement for DB_File.

dir

Directory where the database will be stored. This directory is created if it doesn't exist.

depth

What depth of subdirectories should be crated. depth=0 means that all the files are created in the top directory. depth=1 means that the top directory will contain one level of subdirectories that will themselves contain the files.

The names of subdirectories are created by using 2 characters from the end of the hashed key. For example:

Hash is 02f45789.
depth=0, $dir/02f456789
depth=1, $dir/89/02f456789
depth=2, $dir/89/67/02f456789

OVERLOADING

The following 3 methods are useful if you want to create a subclass and modify the behaviour of FsDB.

__hash

sub __hash 
{
    my( $self, $key ) = @_;
    # ...
}

Allows you to change the hashing algorythm. Please return a string that is at least twice as long as "depth".

set_depth

sub set_depth
{
    my( $self, $depth ) = @_;
    # ...
}

If you want change the hashing algorythm to one that returns more then 32 bits, you might want more then a depth of 4.

__freeze

sub __freeze
{
    my( $self, $data ) = @_;
    # ...
}

Allows you to change serialization method. Note that $data will be an arrayref : first element is the key, second element is the value.

__thaw

sub __thaw
{
    my( $self, $data ) = @_;
    # ...
}

Allows you to change serialization method. You should return an arrayref, the first element is the key, the second is the value.

SEE ALSO

perltie, BerkeleyDB, DB_File, GDBM_File, QDBM_File

AUTHOR

Philip Gwyn, <gwyn -AT- cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2023 by Philip Gwyn

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.26.3 or, at your option, any later version of Perl 5 you may have available.