NAME

DataStore::CAS::FS::DirCodec - Abstract base class for directory encoder/decoders

VERSION

version 0.010100_02

SYNOPSIS

my $file= $cas->get($digest_hash);
my $dir= DataStore::CAS::FS::DirCodec->load($file);

DESCRIPTION

DataStore::CAS::FS stores directories as files. Thus, they need to be serialized and deserialized. I wanted better efficiency than a plain key/value serialization, but also wanted something flexible and future-proof, but also wanted it to be easily cross-platform. In the end I decided on a pluggable implementation, where a "Universal" plugin uses something plain like JSON, and more specialized plugins do things like storing an array of UNIX 'stat' fields. Users can also write their own specialized codecs, and get the features/performance they need while still using the rest of the code un-altered. It also provides an easy path for people to contribute new codecs to the project.

These are the current implementations:

Universal

DataStore::CAS::FS::DirCodec::Universal stores all metadata of each DirEnt using JSON. If you use this codec, you are guaranteed that anything your CAS::FS::Scanner picked up was saved into the CAS.

Minimal

DataStore::CAS::FS::DirCodec::Minimal only stores type, filename, and content reference, and results in a very compact serialization. Use this one if you don't care about permissions and just want enough information for a quick content backup. (ideal for making micro-backups between large comprehensive backups)

Unix

DataStore::CAS::FS::DirCodec::Unix stores bare 'stat' entries for each file. It isn't so rigid as to use fixed-width fields, so it should serve any unix-like architecture with similar stat() fields.

Planned...

Eventually there will also be a DirCodec::UnixAttr if you want to store ACLs and Extended Attributes, a DirCodec::DosFat for fat16/32, and a DirCodec::Windows for ACL-based Windows permissions. Patches welcome.

Your Own

It is very easy to write your own directory serializer! See the section on "EXTENDING".

For large directories, it is possible with this API to write an indexed directory format, where you encode your own b-tree or something in each directory, and then read it on demand as the user requests entries by name.

DIR AND DIRENT OBJECTS

(mentioned here for emphasis)

All Dir objects are intended to be immutable, as are the DirEnt objects they index. They are also cached by DataStore::CAS::FS, so modifying them could cause problems. Don't do that.

If you want to make changes to a DirEnt, use

$entry= $entry->clone( %overrides );

METHODS

load

$dir= $class->load( $file | \%params )

This factory method reads the first few bytes of $file (which must be an instance of DataStore::CAS::File) to determine which codec to use. (but see parameter 'data')

The appropriate codec's ->decode method will then be invoked, if available.

The method can be called with just the file, or with a hashref of parameters.

Parameters:

file

The single $file is equivalent to { file => $file }. It specifies the CAS item to read the serialized directory from.

format

If you know the format ahead of time, you may specify it to prevent load() from needing to read the $file. (though most directory codecs will immediately read it anyway)

format must be one of the registered formats. See "register_format".

handle

If you already opened the file for some reason, you can let the directory re-use your handle. Be warned that the directory will seek to the start of the file first. Also beware that some directory implementations might hold onto the handle and seek around on it as the user iterates the directory.

data

If you already have the full data of the $file, you can supply it to the codec to prevent any I/O activity. You might choose this if you were trying to use the library in a non-blocking or event driven application.

put

$class->put( $cas, $format, \@entries, \%metadata )

Store an array of directory entries, and optionally some directory metadata, into the $cas, encoded in $format.

Returns the digest_hash of the new item.

decode

$dir= $self->decode( \%params )

Same parameters as "load", except they are guaranteed to be a hashref, and it should be assumed that this codec is the correct one to decode the directory.

encode

$self->encode( \@entries, \%metadata )

Encode an array of directory entries, and attach the optional metadata to the encoded directory. Each item of @entries may be either a ::DirEnt object or a hashref of fields.

Codecs should assert that each item has a 'type' and 'name' attribute.

Codecs should inspect 'name' and 'ref' to see if they contain InvalidUTF8 objects, and restore these objects during decode.

Should return a scalar of the serialized directory.

register_format

$class->register_format( $format_id => $codec )

Registers a directory codec to be available to the factory method 'load'.

$format_id is a scalar. Lowercase strings are reserved for the DataStore::CAS distribution, and custom modules are encouraged to use their full package name as the $format_id.

$codec is any object implementing encode and decode. It should probably be a subclass of DirCodec to take advantage of helper methods.

While the system could have been designed to auto-load classes on demand, that seemed like a bad idea because it would allow the contents of the CAS to load perl modules. With this design, codecs must be manually registered (usually during 'require' or 'use') before you will be able to decode or encode with them. All the directory codecs in the standard distribution of DataStore::CAS are enabled by default.

EXTENDING

In order to write your own directory codec, all you need to do is implement 'encode' and 'decode'.

encode API

An encoder receives an array of directory entries, and an optional hashref of metadata. The metadata should be stored as-is. The directory entries can be stored however you like, and you may choose to store only a subset of fields. (be sure to warn users in your documentation if you ignore fields).

The directories can be a mix of DirEnt objects and plain hashrefs. You should ensure that each one has a name and a type, that the type is valid, and that the names aren't duplicated.

Your encoder should attempt to provide a "stable" encoding, so that if it is called with the same parameters twice, it will return the same exact bytes. This likely means you need to sort the directory entries, and that you need to export hashrefs iteratively, because perl will re-arrange the keys randomly.

Your encoded string must be octets (not unicode).

decode API

A decoder takes a file (or a handle, or a scalar with all the data in it) and attempts to build a DataStore::CAS::FS::Dir object which views the directory entries.

See the Universal codec for an example of how to decode from a plain scalar, and Unix codec for an example of how to read through the stream record by record.

You can use the default directory class DataStore::CAS::FS::Dir, or write your own. The default one requires the list of DirEnt objects to be built first, but you could theoretically write an implementation that decodes the entries on demand.

UTILITY METHODS

_magic_number

$str= $class->_magic_number()

Returns a string that all serialized directories start with. This is a constant and should never change.

_calc_header_length

$len= $class->_calc_header_length( $format )

The header length is directly determined by the format string. This method returns the header length in bytes. A directory's encoded data begins at this offset.

_read_format

$fmt_string= $class->_read_format( \%params )

This method inspects the first few bytes of $params{file} to read the format string, which it returns. It first uses $params{data} if available, or $params{handle}, or if neither is available it opens a new handle to the file which it returns in $params.

_readall

$class->_readall( $handle, $buf, $count, $offset )

A small wrapper around 'read()' which croaks if it can't read the full requested number of bytes, and properly handles EINTR and EAGAIN and partial reads.

Always returns true.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.