NAME
File::ContentStore - A store for file content built with hard links
VERSION
version 1.001
SYNOPSIS
use File:::ContentStore;
# the 'path' argument is expected to exist
my $store = File:::ContentStore->new( path => "$ENV{HOME}/.photo_content" );
$store->link_dir( @collection_of_photo_directories );
DESCRIPTION
This module manages a content store as a collection of hard links to a set of files. The files in the content store are named after the digest of the content in the file.
When linking a new file to the content store, a hard link is created to the file, named after the digest of the content. When a file which content is already in the store is linked in, the file is hard linked to the content file in the store.
Example and detailed operation
For a more complete definition of a hard link, see https://en.wikipedia.org/wiki/Hard_link.
Assuming we have directory containing the following files: file1 (inode 123456), file2 (inode 456789) and file3 (inode 789012, content identical to file1). In the examples below, files are sorted by inode.
After linking file1 into the content store, we have the following:
Directory Content store
--------- -------------
[123456] file1 [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[456789] file2 [456789] 8a/80/52e7a4f99c54b966a74144fe5761
[789012] file3
After linking file2:
Directory Content store
--------- -------------
[123456] file1 [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[456789] file2 [456789] 8a/80/52e7a4f99c54b966a74144fe5761
[789012] file3
And finally, after linking file3, we have this:
Directory Content store
--------- -------------
[123456] file1 [123456] d4/1d/8cd98f00b279d1c00998ecf8427e
[123456] file3
[456789] file2 [456789] 8a/80/52e7a4f99c54b966a74144fe5761
i.e. the inode that was holding the content of file3 is lost, and the name now points to the same inode as file1 and its content file.
file1 and file3 are now hard linked (or aliased) together, so any change done to one of them will in fact be done to both. Note also that the disk space taken by duplicated extra files is regained when they are linked through the content store.
If the goal is deduplication and hard-linking of identical files, once all the files have been linked through the content store, the content store is not needed any more, and can be deleted.
Note that since permissions are attached to the inode (and not the individual files), this implies that, when linking a file with the content store, it will set the initial permissions of the content file if it does not exist, and otherwise inherit the permissions of the content file.
ATTRIBUTES
path
The location of the directory where the content files are store. (Required.)
digest
The algorithm used to compute the content digest. (Default: SHA-1
.)
Any string that is suitable for passing to the Digest module constructor is valid. The choice of a digest is a compromise between speed and risk of collisions.
parts
This internal attribute describes in how many parts (i.e. sub-directories) the content filename is split. It is computed automatically from digest.
For example, the empty file would be linked to:
# digest = MD4, parts = 1
31/d6cfe0d16ae931b73c59d7e0c089c0
# digest = MD5, parts = 1
d4/1d8cd98f00b204e9800998ecf8427e
# digest = SHA-1, parts = 1
da/39a3ee5e6b4b0d3255bfef95601890afd80709
# digest = SHA-256, parts = 2
e3/b0/c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
check_for_collisions
When this boolean attribute is set to true, any time the content file for a file linked into the store already exists, the files will be compared for equality before linking them. This prevents data loss in case of collisions.
The default is true to avoid data loss.
If a collision is detected, the solution is to upgrade the digest to a stronger one.
# create a MD5 store
my $md5_store = File::ContentStore->( path => $old, digest => 'MD5' );
# expose a collision
$old_store->link_file($file); # dies
# create a new SHA-1 store
my $sha1_store = File::ContentStore->new( path => $new, digest => 'SHA-1' );
# link the old content to in the new store
# the files that were linked to the old store will be linked to the new one
$sha1_store->link_dir( $md5_store->path );
$sha1_store->link_file( $file->path ); # success!
$md5_store->path->remove_tree; # delete the old content store
make_read_only
When this attribute is set to a true value, a chmod to remove the write permissions is performed on the content files (and therefore the linked files, since permissions are an attribute of the inode).
The default is true, to avoid unwittingly modifying linked files that were identical unbeknownst to the user.
file_callback
This optional coderef is called by "link_file" when linking a file into the store. This is useful for providing user feedback when processing large directories. The callback receives three arguments: the file, its digest and the content file (files are passed as Path::Tiny objects). It is run right after obtaining the file digest, before doing anything else.
Usage example:
File::ContentStore->new(
path => $dir,
file_callback => sub {
my ( $file, $digest, $content ) = @_;
print STDERR "Linking $file ($digest) to $content\n";
}
);
METHODS
new
Constructor. See "ATTRIBUTES" for valid attributes.
link_file
$store->link_file($file);
Link a single file into the content store.
link_dir
$store->link_dir(@dirs);
Recursively link all the files under the given directories.
fsck
Runs a consistency check on the content store (i.e. the files under path), and returns a hash reference containing all the errors found. If no error is found, the hash reference is empty.
The types of errors found are:
- empty
-
An array reference containing all the empty directories under path.
- orphan
-
An array references containing Path::Tiny objects pointing to the content files with no alias (i.e. not linked to any file outside of the content store).
- corrupted
-
An array reference of all content files for which the name does not match the digest of their content.
- symlink
-
An array reference of all symbolic links under path.
SEE ALSO
Other modules suitable for finding duplicated files: File::Find::Duplicates, File::Same.
AUTHOR
Philippe Bruhat (BooK) <book@cpan.org>.
COPYRIGHT
Copyright 2018 Philippe Bruhat (BooK), all rights reserved.
LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.