NAME

File::Rsync::Mirror::Recent - mirroring via rsync made efficient

SYNOPSIS

!!!! PRE-ALPHA ALERT !!!!

Nothing in here is believed to be stable, nothing yet intended for public consumption. The plan is to provide scripts that act as frontends for all the backend functionality. Option and method names may still change.

For the rationale see the section BACKGROUND.

The documentation in here is normally not needed because the code is meant to be run from several standalone programs. For a quick overview, see the file README.mirrorcpan and the bin/ directory of the distribution. For the architectural ideas see the section THE ARCHITECTURE OF A COLLECTION OF RECENTFILES below.

File::Rsync::Mirror::Recent establishes a view on a collection of File::Rsync::Mirror::Recentfile objects and provides abstractions spanning multiple intervals associated with those.

EXPORT

No exports.

CONSTRUCTORS

my $obj = CLASS->new(%hash)

Constructor. On every argument pair the key is a method name and the value is an argument to that method name.

ACCESSORS

as in F:R:M:Recentfile

local

Option to specify the local principal file for operations with a local collection of recentfiles.

localroot

as in F:R:M:Recentfile

max_files_per_connection

as in F:R:M:Recentfile

remote

TBD

remoteroot

XXX: this is (ATM) different from Recentfile!!!

remote_recentfile

Rsync address of the remote RECENT.recent symlink or whichever name the principal remote recentfile has.

rsync_options

Things like compress, links, times or checksums. Passed in to the File::Rsync object used to run the mirror.

ttl

Minimum time before fetching the principal recentfile again.

_verbose

Boolean to turn on a bit verbosity. Use the method verbose to also set the verbosity of associated Recentfile objects.

METHODS

$arrayref = $obj->news ( %options )

Test this with:

perl -Ilib bin/rrr-news \
     -after 1217200539 \
     -max 12 \
     -local /home/ftp/pub/PAUSE/authors/RECENT.recent

perl -Ilib bin/rrr-news \
     -after 1217200539 \
     -rsync=compress=1 \
     -rsync=links=1 \
     -localroot /home/ftp/pub/PAUSE/authors/ \
     -remote pause.perl.org::authors/RECENT.recent
     -verbose

Note: all parameters that can be passed to recent_events can also be specified here.

Note: all data are kept in memory

overview ( %options )

returns a small table that summarizes the state of all recentfiles collected in this Recent object.

$options{verbose}=1 increases the number of columns displayed.

Here is an example output:

Ival   Cnt           Max           Min       Span   Util          Cloud
  1h    47 1225053014.38 1225049650.91    3363.47  93.4% ^  ^
  6h   324 1225052939.66 1225033394.84   19544.82  90.5%  ^   ^
  1d   437 1225049651.53 1224966402.53   83248.99  96.4%   ^    ^
  1W  1585 1225039015.75 1224435339.46  603676.29  99.8%     ^    ^
  1M  5855 1225017376.65 1222428503.57 2588873.08  99.9%       ^    ^
  1Q 17066 1224578930.40 1216803512.90 7775417.50 100.0%         ^   ^
  1Y 15901 1223966162.56 1216766820.67 7199341.89  22.8%           ^  ^
   Z  9909 1223966162.56 1216766820.67 7199341.89      -           ^  ^

Max is the name of the interval.

Cnt is the number of entries in this recentfile.

Max is the highest(first) epoch in this recentfile, rounded.

Min is the lowest(last) epoch in thie recentfile, rounded.

Span is the timespan currently covered, rounded.

Util is Span devided by the designated timespan of this recentfile.

Cloud is ascii art illustrating the sequence of the Max and Min timestamps.

_pathdb

Keeping track of already handled files. Currently it is a hash, will probably become a database with its own accessors.

$recentfile = $obj->principal_recentfile ()

returns the principal recentfile of this tree.

$recentfiles_arrayref = $obj->recentfiles ()

returns a reference to the complete list of recentfile objects that describe this tree. No guarantee is given that the represented recentfiles exist or have been read. They are just bare objects.

$success = $obj->rmirror ( %options )

Mirrors all recentfiles of the remote address working through all of them, mirroring their contents.

Test this with:

use File::Rsync::Mirror::Recent;
my $rrr = File::Rsync::Mirror::Recent->new(
       ignore_link_stat_errors => 1,
       localroot => "/home/ftp/pub/PAUSE/authors",
       remote => "pause.perl.org::authors/RECENT.recent",
       max_files_per_connection => 5000,
       rsync_options => {
                         compress => 1,
                         links => 1,
                         times => 1,
                         checksum => 0,
                        },
       verbose => 1,
       _runstatusfile => "recent-rmirror-state.yml",
       _logfilefordone => "recent-rmirror-donelog.log",
);
$rrr->rmirror ( "skip-deletes" => 1, loop => 1 );

Or try without the loop parameter and write the loop yourself:

use File::Rsync::Mirror::Recent;
my @rrr;
for my $t ("authors","modules"){
    my $rrr = File::Rsync::Mirror::Recent->new(
       ignore_link_stat_errors => 1,
       localroot => "/home/ftp/pub/PAUSE/$t",
       remote => "pause.perl.org::$t/RECENT.recent",
       max_files_per_connection => 512,
       rsync_options => {
                         compress => 1,
                         links => 1,
                         times => 1,
                         checksum => 0,
                        },
       verbose => 1,
       _runstatusfile => "recent-rmirror-state-$t.yml",
       _logfilefordone => "recent-rmirror-donelog-$t.log",
       ttl => 5,
    );
    push @rrr, $rrr;
}
while (){
  for my $rrr (@rrr){
    $rrr->rmirror ( "skip-deletes" => 1 );
  }
  warn "sleeping 23\n"; sleep 23;
}

$verbose = $obj->verbose ( $set )

Getter/setter method to set verbosity for this object and all associated Recentfile objects.

THE ARCHITECTURE OF A COLLECTION OF RECENTFILES

The idea is that we want to have a short file that records really recent changes. So that a fresh mirror can be kept fresh as long as the connectivity is given. Then we want longer files that record the history before. So when the mirror falls behind the update period reflected in the shortest file, it can complement the list of recent file events with the next one. And if this is not long enough we want another one, again a bit longer. And we want one that completes the history back to the oldest file. The index files do contain the complete list of current files. The longer a period covered by an index file is gone the less often the index file is updated. For practical reasons adjacent files will often overlap a bit but this is neither necessary nor enforced. That's the basic idea. The following example represents a tree that has a few updates every day:

RECENT.recent -> RECENT-1h.yaml
RECENT-6h.yaml
RECENT-1d.yaml
RECENT-1M.yaml
RECENT-1W.yaml
RECENT-1Q.yaml
RECENT-1Y.yaml
RECENT-Z.yaml

The first file is the principal file, in so far it is the one that is written first after a filesystem change. Usually a symlink links to it with a filename that has the same filenameroot and the suffix .recent. On systems that do not support symlinks there is a plain copy maintained instead.

The last file, the Z file, contains the complementary files that are in none of the other files. It does never contain deletes. Besides this it serves the role of a recovery mechanism or spill over pond. When things go wrong, it's a valuable controlling instance to hold the differences between the collection of limited interval files and the actual filesystem.

THE INDIVIDUAL RECENTFILE

A recentfile consists of a hash that has two keys: meta and recent. The meta part has metadata and the recent part has a list of fileobjects.

THE META PART

Here we find things that are pretty much self explaining: all lowercase attributes are accessors and as such explained somewhere above in this manpage. The uppercase attribute Producers contains version information about involved software components. Nothing to worry about as I believe.

THE RECENT PART

This is the interesting part. Every entry refers to some filesystem change (with path, epoch, type).

The epoch value is the point in time when some change was registered but can be set to arbitrary values. Do not be tempted to believe that the entry has a direct relation to something like modification time or change time on the filesystem level. They are not reflecting release dates. (If you want exact release dates: Barbie is providing a database of them. See http://use.perl.org/~barbie/journal/37907).

All these entries can be devided into two types (denoted by the type attribute): news and deletes. Changes and creations are news. Deletes are deletes.

Besides an epoch and a type attribute we find a third one: path. This path is relative to the directory we find the recentfile in.

The order of the entries in the recentfile is by decreasing epoch attribute. These are unique floating point numbers. When the server has ntp running correctly, then the timestamps are usually reflecting a real epoch. If time is running backwards, we trump the system epoch with strictly monotonically increasing floating point timestamps and guarantee they are unique.

CORRUPTION AND RECOVERY

If the origin host breaks the promise to deliver consistent and complete recentfiles then the way back to sanity shall be achieved through traditional rsyncing between the hosts. But don't forget to report it as a bug:)

BACKGROUND

This is about speeding up rsync operation on large trees. Uses a small metadata cocktail and pull technology.

NON-COMPETITORS

File::Mirror        JWU/File-Mirror/File-Mirror-0.10.tar.gz only local trees
Mirror::YAML        ADAMK/Mirror-YAML-0.03.tar.gz           some sort of inner circle
Net::DownloadMirror KNORR/Net-DownloadMirror-0.04.tar.gz    FTP sites and stuff
Net::MirrorDir      KNORR/Net-MirrorDir-0.05.tar.gz         dito
Net::UploadMirror   KNORR/Net-UploadMirror-0.06.tar.gz      dito
Pushmi::Mirror      CLKAO/Pushmi-v1.0.0.tar.gz              something SVK

rsnapshot           www.rsnapshot.org                       focus on backup
csync               www.csync.org                           more like unison
multi-rsync         sourceforge 167893                      lan push to many

COMPETITORS

The problem to solve which clusters and ftp mirrors and otherwise replicated datasets like CPAN share: how to transfer only a minimum amount of data to determine the diff between two hosts.

Normally it takes a long time to determine the diff itself before it can be transferred. Known solutions at the time of this writing are csync2, and rsync 3 batch mode.

For many years the best solution was csync2 which solves the problem by maintaining a sqlite database on both ends and talking a highly sophisticated protocol to quickly determine which files to send and which to delete at any given point in time. Csync2 is often inconvenient because it is push technology and the act of syncing demands quite an intimate relationship between the sender and the receiver. This is hard to achieve in an environment of loosely coupled sites where the number of sites is large or connections are unreliable or network topology is changing.

Rsync 3 batch mode works around these problems by providing rsync-able batch files which allow receiving nodes to replay the history of the other nodes. This reduces the need to have an incestuous relation but it has the disadvantage that these batch files replicate the contents of the involved files. This seems inappropriate when the nodes already have a means of communicating over rsync.

rersyncrecent solves this problem with a couple of (usually 2-10) index files which cover different overlapping time intervals. The master writes these files and the clients/slaves can construct the full tree from the information contained in them. The most recent index file usually covers the last seconds or minutes or hours of the tree and depending on the needs, slaves can rsync every few seconds or minutes and then bring their trees in full sync.

The rersyncrecent mode was developed for CPAN but I hope it is a convenient and economic general purpose solution. I'm looking forward to see a CPAN backbone that is only a few seconds behind PAUSE. And then ... the first FUSE based CPAN filesystem anyone?

FUTURE DIRECTIONS

Currently the origin server must keep track of injected and removed files. Should be supported by an inotify-based assistant.

SEE ALSO

File::Rsync::Mirror::Recentfile, File::Rsync::Mirror::Recentfile::Done, File::Rsync::Mirror::Recentfile::FakeBigFloat

BUGS

Please report any bugs or feature requests through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Rsync-Mirror-Recent. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc File::Rsync::Mirror::Recent

You can also look for information at:

ACKNOWLEDGEMENTS

Thanks to RJBS for module-starter.

AUTHOR

Andreas König

COPYRIGHT & LICENSE

Copyright 2008, 2009 Andreas König.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.