NAME

DataStore::CAS - Abstract base class for Content Addressable Storage

VERSION

version 0.03

SYNOPSIS

# Create a new CAS which stores everything in plain files.
my $cas= DataStore::CAS::Simple->new(
  path   => './foo/bar',
  create => 1,
  digest => 'SHA-256',
);

# Store content, and get its hash code
my $hash= $cas->put_scalar("Blah");

# Retrieve a reference to that content
my $file= $cas->get($hash);

# Inspect the file's attributes
$file->size < 1024*1024 or die "Use a smaller file";

# Open a handle to that file (possibly returning a virtual file handle)
my $handle= $file->open;
my @lines= <$handle>;

DESCRIPTION

This module lays out a very straightforward API for Content Addressable Storage.

Content Addressable Storage is a concept where a file is identified by a one-way message digest checksum of its content. (usually called a "hash") With a good message digest algorithm, one checksum will statistically only ever refer to one file, even though the permutations of the checksum are tiny compared to all the permutations of bytes that they can represent.

Perl uses the term 'hash' to refer to a mapping of key/value pairs, which creates a little confusion. The documentation of this and related modules try to use the phrase "digest hash" to clarify when we are referring to the output of a digest function vs. a perl key-value mapping.

In short, a CAS is a key/value mapping where small-ish keys are determined from large-ish data but no two pieces of data will ever end up with the same key, thanks to astronomical probabilities. You can then use the small-ish key as a reference to the large chunk of data, as a sort of compression technique.

PURPOSE

One great use for CAS is finding and merging duplicated content. If you take two identical files (which you didn't know were identical) and put them both into a CAS, you will get back the same hash, telling you that they are the same. Also, the file will only be stored once, saving disk space.

Another great use for CAS is the ability for remote systems to compare an inventory of files and see which ones are absent on the other system. This has applications in backups and content distribution.

ATTRIBUTES

digest

Read-only. The name of the digest algorithm being used.

Subclasses must set this during their constructor.

The algorithm should be available from the Digest module, or else the subclass will need to provide a few additional methods like "calculate_hash".

hash_of_null

The digest hash of the empty string.

calculate_hash

Return the hash of a scalar (or scalar ref) in memory.

calculate_file_hash

Return the hash of a file on disk.

METHODS

get

$cas->get( $digest_hash )

Returns a DataStore::CAS::File object for the given hash, if the hash exists in storage. Else, returns undef.

This method is pure-virtual and must be implemented in the subclass.

put

$cas->put( $thing, \%optional_flags )

Convenience method. Inspects $thing and passes it off to a more specific method. If you want more control over which method is called, call it directly.

The %optional_flags can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will calculate the hash of the $thing, and go through the motions of writing it, but not store it.

known_hashes => \%digest_hashes
{ known_hashes => { SHA1 => '0123456789...' } }

Use this to skip calculation of the hash. The hashes are keyed by Digest name, so it is safe to use even when the store being written to might not use the same digest that was already calculated.

Of course, using this feature can corrupt your CAS if you don't ensure that the hash is correct.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation, such as number of bytes written, compression strategies used, etc. The statistics are returned within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

The return value is the hash checksum of the stored data, regardless of whether it was already present in the CAS.

Example:

my $stats= {};
$cas->put("abcdef", { stats => $stats });
$cas->put(\$large_buffer, { stats => $stats });
$cas->put(IO::File->new('~/file','r'), { stats => $stats });
$cas->put(\*STDIN, { stats => $stats });
$cas->put(Path::Class::file('~/file'), { stats => $stats });
use Data::Printer;
p $stats;

put_scalar

$cas->put_scalar( $scalar, \%optional_flags )
$cas->put_scalar( \$scalar, \%optional_flags )

Puts the literal string "$scalar" into the CAS, or the scalar pointed to by a scalar-ref. (a scalar-ref can help by avoiding a copy of a large scalar) The scalar must be a string of bytes; you get an exception if any character has a codepoint above 255.

Returns the digest hash of the array of bytes.

See "put" for the discussion of %flags.

put_file

$digest_hash= $cas->put_file( $filename, \%optional_flags );
$digest_hash= $cas->put_file( $Path_Class_File, \%optional_flags );
$digest_hash= $cas->put_file( $DataStore_CAS_File, \%optional_flags );

Insert a file from the filesystem, or from another CAS instance. Default implementation simply opens the named file, and passes it to put_handle.

Returns the digest hash of the data stored.

See "put" for the discussion of standard %flags.

Additional flags:

move => $bool

If move is true, and the CAS is backed by plain files on the same filesystem, it will move the file into the CAS, possibly changing its owner and permissions. Even if the file can't be moved, put_file will attempt to unlink it, and die on failure.

If hardlink is true, and the CAS is backed by plain files on the same filesystem by the same owner and permissions as the destination CAS, it will hardlink the file directly into the CAS.

This reduces the integrity of your CAS; use with care. You can use the "validate" method later to check for corruption.

reuse_hash => $bool

This is a shortcut for known_hashes if you specify an instance of DataStore::CAS::File. It builds a known_hashes of one item using the source CAS's digest algorithm.

Note: A good use of these flags is to transfer files from one instance of DataStore::CAS::Simple to another.

my $file= $cas1->get($hash);
$cas2->put($file, { hardlink => 1, reuse_hash => 1 });

put_handle

$digest_hash= $cas->put_handle( \*HANDLE | IO::Handle, \%optional_flags );

Reads from $io_handle and stores into the CAS. Calculates the digest hash of the data as it goes. Does not seek on handle, so if you supply a handle that is not at the start of the file, only the remainder of the file will be added and hashed. Dies on any I/O errors.

Returns the calculated hash when complete.

See "put" for the discussion of flags.

new_write_handle

$handle= $cas->new_write_handle( %flags )

Get a new handle for writing to the Store. The data written to this handle will be saved to a temporary file as the digest hash is calculated.

When done writing, call either $cas-commit_write_handle( $handle )> (or the alias $handle-commit()>) which returns the hash of all data written. The handle will no longer be valid.

If you free the handle without committing it, the data will not be added to the CAS.

The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will calculate the hash of the $thing, but not store it.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation, such as number of bytes written, compression strategies used, etc. The statistics are returned within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

Write handles will probably be an instance of FileCreatorHandle.

commit_write_handle

my $handle= $cas->new_write_handle();
print $handle $data;
$cas->commit_write_handle($handle);

This closes the given write-handle, and then finishes calculating its digest hash, and then stores it into the CAS (unless the handle was created with the dry_run flag). It returns the digest_hash of the data.

validate

$bool_valid= $cas->validate( $digest_hash, \%optional_flags )

Validate an entry of the CAS. This is used to detect whether the storage has become corrupt. Returns 1 if the hash checks out ok, and returns 0 if it fails, and returns undef if the hash doesn't exist.

Like the "put" method, you can pass a hashref in $flags{stats} which will receive information about the file. This can be used to implement mark/sweep algorithms for cleaning out the CAS by asking the CAS for all other digest_hashes referenced by $digest_hash.

The default implementation simply reads the file and re-calculates its hash, which should be optimized by subclasses if possible.

delete

$bool_happened= $cas->delete( $digest_hash, %optional_flags )

DO NOT USE THIS METHOD UNLESS YOU UNDERSTAND THE CONSEQUENCES

This method is supplied for completeness... however it is not appropriate to use in many scenarios. Some storage engines may use referencing, where one file is stored as a diff against another file, or one file is composed of references to others. It can be difficult to determine whether a given digest_hash is truly no longer used.

The safest way to clean up a CAS is to create a second CAS and migrate the items you want to keep from the first to the second; then delete the original CAS. See the documentation on the storage engine you are using to see if it supports an efficient way to do this. For instance, DataStore::CAS::Simple can use hard-links on supporting filesystems, resulting in a very efficient copy operation.

If no efficient mechanisms are available, then you might need to write a mark/sweep algorithm and then make use of 'delete'.

Returns true if the item was actually deleted.

The optional 'flags' hashref can contain a wide variety of parameters, but these are supported by all CAS subclasses:

dry_run => $bool

Setting "dry_run" to true will run a simulation of the delete operation, without actually deleting anything.

stats => \%stats_out

Setting "stats" to a hashref will instruct the CAS implementation to return information about the operation within that supplied hashref. Values in the hashref are amended or added to, so you may use the same stats hashref for multiple calls and then see the summary for all operations when you are done.

delete_count

The number of official entries deleted.

delete_missing

The number of entries that didn't exist.

iterator

$iter= $cas->iterator( \%optional_flags )
while (defined ($digest_hash= $iter->())) { ... }

Iterate the contents of the CAS. Returns a perl-style coderef iterator which returns the next digest_hash string each time you call it. Returns undef at end of the list.

%flags :

prefix

Specify a prefix for all the returned digest hashes. This acts as a filter. You can use this to imitate Git's feature of identifying an object by a portion of its hash instead of having to type the whole thing. You will probably need more digits though, because you're searching the whole CAS, and not just commit entries.

open_file

$handle= $cas->open_file( $fileObject, \%optional_flags )

Open the File object (returned by "get") and return a readable and seekable filehandle to it. The filehandle might be a perl filehandle, or might be a tied object implementing the filehandle operations.

Flags:

layer (TODO)

When implemented, this will allow you to specify a Parl I/O layer, like 'raw' or 'utf8'. This is equivalent to calling 'binmode' with that argument on the filehandle. Note that returned handles are 'raw' by default.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2022 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.