NAME
DataStore::CAS - Abstract base class for Content Addressable Storage
VERSION
version 0.0200
SYNOPSIS
# Create a new CAS which stores everything in plain files.
my $cas= DataStore::CAS::Simple->new(
path => './foo/bar',
create => 1,
digest => 'SHA-256',
);
# Store content, and get its hash code
my $hash= $cas->put_scalar("Blah");
# Retrieve a reference to that content
my $file= $cas->get($hash);
# Inspect the file's attributes
$file->size < 1024*1024 or die "Use a smaller file";
# Open a handle to that file (possibly returning a virtual file handle)
my $handle= $file->open;
my @lines= <$handle>;
DESCRIPTION
This module lays out a very straightforward API for Content Addressable Storage.
Content Addressable Storage is a concept where a file is identified by a one-way message digest checksum of its content. (usually called a "hash") With a good message digest algorithm, one checksum will statistically only ever refer to one file, even though the permutations of the checksum are tiny compared to all the permutations of bytes that they can represent.
Perl uses the term 'hash' to refer to a mapping of key/value pairs, which creates a little confusion. The documentation of this and related modules try to use the phrase "digest hash" to clarify when we are referring to the output of a digest function vs. a perl key-value mapping.
In short, a CAS is a key/value mapping where small-ish keys are determined from large-ish data but no two pieces of data will ever end up with the same key, thanks to astronomical probabilities. You can then use the small-ish key as a reference to the large chunk of data, as a sort of compression technique.
PURPOSE
One great use for CAS is finding and merging duplicated content. If you take two identical files (which you didn't know were identical) and put them both into a CAS, you will get back the same hash, telling you that they are the same. Also, the file will only be stored once, saving disk space.
Another great use for CAS is the ability for remote systems to compare an inventory of files and see which ones are absent on the other system. This has applications in backups and content distribution.
ATTRIBUTES
digest
Read-only. The name of the digest algorithm being used.
Subclasses must set this during their constructor.
hash_of_null
The digest hash of the empty string. The cached result of
$cas->put('', { dry_run => 1 })
METHODS
get
$cas->get( $digest_hash )
Returns a DataStore::CAS::File object for the given hash, if the hash exists in storage. Else, returns undef.
This method is pure-virtual and must be implemented in the subclass.
put
$cas->put( $thing, \%optional_flags )
Convenience method. Inspects $thing and passes it off to a more specific method. If you want more control over which method is called, call it directly.
Scalars are passed to "put_scalar".
Instances of DataStore::CAS::File or Path::Class::File are passed to "put_file".
Globrefs or instances of IO::Handle are passed to "put_handle".
Dies if it encounters anything else.
The return value is the digest hash of the stored data.
See "new_write_handle" for the discussion of flags
.
Example:
my $stats= {};
$cas->put("abcdef", { stats => $stats });
$cas->put(IO::File->new('~/file','r'), { stats => $stats });
$cas->put(\*STDIN, { stats => $stats });
$cas->put(Path::Class::file('~/file'), { stats => $stats });
use Data::Printer;
p $stats;
put_scalar
$cas->put_scalar( $scalar, \%optional_flags )
Puts the literal string "$scalar" into the CAS. If scalar is a unicode string, it is first converted to an array of UTF-8 bytes. Beware that when you next call "get", reading from the filehandle will give you bytes and not the original Unicode scalar.
Returns the digest hash of the array of bytes.
See "new_write_handle" for the discussion of flags
.
put_file
$digest_hash= $cas->put_file( $filename, \%optional_flags );
$digest_hash= $cas->put_file( $Path_Class_File, \%optional_flags );
$digest_hash= $cas->put_file( $DataStore_CAS_File, \%optional_flags );
Insert a file from the filesystem, or from another CAS instance. Default implementation simply opens the named file, and passes it to put_handle.
Returns the digest hash of the data stored.
See "new_write_handle" for the discussion of flags
.
Additional flags:
- hardlink => $bool
-
If hardlink is true, and the CAS is backed by plain files, it will hardlink the file directly into the CAS.
This reduces the integrity of your CAS; use with care. You can use the "validate" method later to check for corruption.
- known_hashes => \%algorithm_digests
-
If you already know the hash of your file, and don't want to re-calculate it, pass a hashref like
{ $algorithm_name => $digest_hash }
for this flag, and if this CAS is using one of those algorithms, it will use the hash you specified instead of re-calculating it.This reduces the integrity of your CAS; use with care.
- reuse_hash
-
This is a shortcut for known_hashes if you specify an instance of DataStore::CAS::File. It builds a known_hashes of one item using the source CAS's digest algorithm.
Note: A good use of these flags is to transfer files from one instance of DataStore::CAS::Simple to another.
my $file= $cas1->get($hash);
$cas2->put($file, { hardlink => 1, reuse_hash => 1 });
put_handle
$digest_hash= $cas->put_handle( \*HANDLE | IO::Handle, \%optional_flags );
Reads from $io_handle and stores into the CAS. Calculates the digest hash of the data as it goes. Dies on any I/O errors.
Returns the calculated hash when complete.
See "new_write_handle" for the discussion of flags
.
new_write_handle
$handle= $cas->new_write_handle( %flags )
Get a new handle for writing to the Store. The data written to this handle will be saved to a temporary file as the digest hash is calculated.
When done writing, call either $cas-
commit_write_handle( $handle )> (or the alias $handle-
commit()>) which returns the hash of all data written. The handle will no longer be valid.
If you free the handle without committing it, the data will not be added to the CAS.
The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:
- dry_run => $bool
-
Setting "dry_run" to true will calculate the hash of the $thing, but not store it.
- stats => \%stats_out
-
Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation, such as number of bytes written, compression strategies used, etc. The statistics are returned within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.
commit_write_handle
my $handle= $cas->new_write_handle();
print $handle $data;
$cas->commit_write_handle($handle);
This closes the given write-handle, and then finishes calculating its digest hash, and then stores it into the CAS (unless the handle was created with the dry_run flag). It returns the digest_hash of the data.
validate
$bool_valid= $cas->validate( $digest_hash, \%optional_flags )
Validate an entry of the CAS. This is used to detect whether the storage has become corrupt. Returns 1 if the hash checks out ok, and returns 0 if it fails, and returns undef if the hash doesn't exist.
Like the "put" method, you can pass a hashref in $flags{stats} which will receive information about the file. This can be used to implement mark/sweep algorithms for cleaning out the CAS by asking the CAS for all other digest_hashes referenced by $digest_hash.
The default implementation simply reads the file and re-calculates its hash, which should be optimized by subclasses if possible.
delete
$bool_happened= $cas->delete( $digest_hash, %optional_flags )
DO NOT USE THIS METHOD UNLESS YOU UNDERSTAND THE CONSEQUENCES
This method is supplied for completeness... however it is not appropriate to use in many scenarios. Some storage engines may use referencing, where one file is stored as a diff against another file, or one file is composed of references to others. It can be difficult to determine whether a given digest_hash is truly no longer used.
The safest way to clean up a CAS is to create a second CAS and migrate the items you want to keep from the first to the second; then delete the original CAS. See the documentation on the storage engine you are using to see if it supports an efficient way to do this. For instance, DataStore::CAS::Simple can use hard-links on supporting filesystems, resulting in a very efficient copy operation.
If no efficient mechanisms are available, then you might need to write a mark/sweep algorithm and then make use of 'delete'.
Returns true if the item was actually deleted.
The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:
- dry_run => $bool
-
Setting "dry_run" to true will run a simulation of the delete operation, without actually deleting anything.
- stats => \%stats_out
-
Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.
- delete_count
-
The number of official entries deleted.
- delete_missing
-
The number of entries that didn't exist.
iterator
$iter= $cas->iterator( \%optional_flags )
while (defined ($digest_hash= $iter->())) { ... }
Iterate the contents of the CAS. Returns a perl-style coderef iterator which returns the next digest_hash string each time you call it. Returns undef at end of the list.
%flags
:
- prefix
-
Specify a prefix for all the returned digest hashes. This acts as a filter. You can use this to imitate Git's feature of identifying an object by a portion of its hash instead of having to type the whole thing. You will probably need more digits though, because you're searching the whole CAS, and not just commit entries.
open_file
$handle= $cas->open_file( $fileObject, \%optional_flags )
Open the File object (returned by "get") and return a readable and seekable filehandle to it. The filehandle might be a perl filehandle, or might be a tied object implementing the filehandle operations.
Flags:
- layer (TODO)
-
When implemented, this will allow you to specify a Parl I/O layer, like 'raw' or 'utf8'. This is equivalent to calling 'binmode' with that argument on the filehandle. Note that returned handles are 'raw' by default.
HANDLE OBJECTS
The handles returned by open_file and new_write_handle are compatible with both the old GLOBREF style functions and the new IO::Handle API. In other words, you can use either
$handle->read($buffer, 100)
or
read($handle, $buffer, 100)
So they are nicely compatible with other libraries you might use. It is unlikely that they are actually real handles though, so you probably can't sysread/syswrite on them. You can find out by checking "fileno($handle)". One notable exception is DataStore::CAS::Simple->open_file, which always returns a direct filehandle to the underlying file.
AUTHOR
Michael Conrad <mconrad@intellitree.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by Michael Conrad, and IntelliTree Solutions llc.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.