NAME
File::Rsync::Mirror::Recent - mirroring via rsync made efficient
SYNOPSIS
!!!! PRE-ALPHA ALERT !!!!
Nothing in here is believed to be stable, nothing yet intended for public consumption. The plan is to provide scripts that act as frontends for all the backend functionality. Option and method names may still change.
For the rationale see the section BACKGROUND.
The documentation in here is normally not needed because the code is meant to be run from several standalone programs. For a quick overview, see the file README.mirrorcpan and the bin/ directory of the distribution. For the architectural ideas see the section THE ARCHITECTURE OF A COLLECTION OF RECENTFILES below.
File::Rsync::Mirror::Recent establishes a view on a collection of File::Rsync::Mirror::Recentfile objects and provides abstractions spanning multiple intervals associated with those.
EXPORT
No exports.
CONSTRUCTORS
my $obj = CLASS->new(%hash)
Constructor. On every argument pair the key is a method name and the value is an argument to that method name.
ACCESSORS
- ignore_link_stat_errors
-
as in F:R:M:Recentfile
- local
-
Option to specify the local principal file for operations with a local collection of recentfiles.
- localroot
-
as in F:R:M:Recentfile
- max_files_per_connection
-
as in F:R:M:Recentfile
- remote
-
TBD
- remoteroot
-
XXX: this is (ATM) different from Recentfile!!!
- remote_recentfile
-
Rsync address of the remote
RECENT.recentsymlink or whichever name the principal remote recentfile has. - rsync_options
-
Things like compress, links, times or checksums. Passed in to the File::Rsync object used to run the mirror.
- ttl
-
Minimum time before fetching the principal recentfile again.
- _verbose
-
Boolean to turn on a bit verbosity. Use the method
verboseto also set the verbosity of associated Recentfile objects.
METHODS
$arrayref = $obj->news ( %options )
Test this with:
perl -Ilib bin/rrr-news \
-after 1217200539 \
-max 12 \
-local /home/ftp/pub/PAUSE/authors/RECENT.recent
perl -Ilib bin/rrr-news \
-after 1217200539 \
-rsync=compress=1 \
-rsync=links=1 \
-localroot /home/ftp/pub/PAUSE/authors/ \
-remote pause.perl.org::authors/RECENT.recent
-verbose
Note: all parameters that can be passed to recent_events can also be specified here.
Note: all data are kept in memory
overview ( %options )
returns a small table that summarizes the state of all recentfiles collected in this Recent object.
$options{verbose}=1 increases the number of columns displayed.
Here is an example output:
Ival Cnt Max Min Span Util Cloud
1h 47 1225053014.38 1225049650.91 3363.47 93.4% ^ ^
6h 324 1225052939.66 1225033394.84 19544.82 90.5% ^ ^
1d 437 1225049651.53 1224966402.53 83248.99 96.4% ^ ^
1W 1585 1225039015.75 1224435339.46 603676.29 99.8% ^ ^
1M 5855 1225017376.65 1222428503.57 2588873.08 99.9% ^ ^
1Q 17066 1224578930.40 1216803512.90 7775417.50 100.0% ^ ^
1Y 15901 1223966162.56 1216766820.67 7199341.89 22.8% ^ ^
Z 9909 1223966162.56 1216766820.67 7199341.89 - ^ ^
Max is the name of the interval.
Cnt is the number of entries in this recentfile.
Max is the highest(first) epoch in this recentfile, rounded.
Min is the lowest(last) epoch in thie recentfile, rounded.
Span is the timespan currently covered, rounded.
Util is Span devided by the designated timespan of this recentfile.
Cloud is ascii art illustrating the sequence of the Max and Min timestamps.
_pathdb
Keeping track of already handled files. Currently it is a hash, will probably become a database with its own accessors.
$recentfile = $obj->principal_recentfile ()
returns the principal recentfile of this tree.
$recentfiles_arrayref = $obj->recentfiles ()
returns a reference to the complete list of recentfile objects that describe this tree. No guarantee is given that the represented recentfiles exist or have been read. They are just bare objects.
$success = $obj->rmirror ( %options )
Mirrors all recentfiles of the remote address working through all of them, mirroring their contents.
Test this with:
use File::Rsync::Mirror::Recent;
my $rrr = File::Rsync::Mirror::Recent->new(
ignore_link_stat_errors => 1,
localroot => "/home/ftp/pub/PAUSE/authors",
remote => "pause.perl.org::authors/RECENT.recent",
max_files_per_connection => 5000,
rsync_options => {
compress => 1,
links => 1,
times => 1,
checksum => 0,
},
verbose => 1,
_runstatusfile => "recent-rmirror-state.yml",
_logfilefordone => "recent-rmirror-donelog.log",
);
$rrr->rmirror ( "skip-deletes" => 1, loop => 1 );
Or try without the loop parameter and write the loop yourself:
use File::Rsync::Mirror::Recent;
my @rrr;
for my $t ("authors","modules"){
my $rrr = File::Rsync::Mirror::Recent->new(
ignore_link_stat_errors => 1,
localroot => "/home/ftp/pub/PAUSE/$t",
remote => "pause.perl.org::$t/RECENT.recent",
max_files_per_connection => 512,
rsync_options => {
compress => 1,
links => 1,
times => 1,
checksum => 0,
},
verbose => 1,
_runstatusfile => "recent-rmirror-state-$t.yml",
_logfilefordone => "recent-rmirror-donelog-$t.log",
ttl => 5,
);
push @rrr, $rrr;
}
while (){
for my $rrr (@rrr){
$rrr->rmirror ( "skip-deletes" => 1 );
}
warn "sleeping 23\n"; sleep 23;
}
$verbose = $obj->verbose ( $set )
Getter/setter method to set verbosity for this object and all associated Recentfile objects.
THE ARCHITECTURE OF A COLLECTION OF RECENTFILES
The idea is that we want to have a short file that records really recent changes. So that a fresh mirror can be kept fresh as long as the connectivity is given. Then we want longer files that record the history before. So when the mirror falls behind the update period reflected in the shortest file, it can complement the list of recent file events with the next one. And if this is not long enough we want another one, again a bit longer. And we want one that completes the history back to the oldest file. The index files do contain the complete list of current files. The longer a period covered by an index file is gone the less often the index file is updated. For practical reasons adjacent files will often overlap a bit but this is neither necessary nor enforced. That's the basic idea. The following example represents a tree that has a few updates every day:
RECENT.recent -> RECENT-1h.yaml
RECENT-6h.yaml
RECENT-1d.yaml
RECENT-1M.yaml
RECENT-1W.yaml
RECENT-1Q.yaml
RECENT-1Y.yaml
RECENT-Z.yaml
The first file is the principal file, in so far it is the one that is written first after a filesystem change. Usually a symlink links to it with a filename that has the same filenameroot and the suffix .recent. On systems that do not support symlinks there is a plain copy maintained instead.
The last file, the Z file, contains the complementary files that are in none of the other files. It does never contain deletes. Besides this it serves the role of a recovery mechanism or spill over pond. When things go wrong, it's a valuable controlling instance to hold the differences between the collection of limited interval files and the actual filesystem.
THE INDIVIDUAL RECENTFILE
A recentfile consists of a hash that has two keys: meta and recent. The meta part has metadata and the recent part has a list of fileobjects.
THE META PART
Here we find things that are pretty much self explaining: all lowercase attributes are accessors and as such explained somewhere above in this manpage. The uppercase attribute Producers contains version information about involved software components. Nothing to worry about as I believe.
THE RECENT PART
This is the interesting part. Every entry refers to some filesystem change (with path, epoch, type).
The epoch value is the point in time when some change was registered but can be set to arbitrary values. Do not be tempted to believe that the entry has a direct relation to something like modification time or change time on the filesystem level. They are not reflecting release dates. (If you want exact release dates: Barbie is providing a database of them. See http://use.perl.org/~barbie/journal/37907).
All these entries can be devided into two types (denoted by the type attribute): news and deletes. Changes and creations are news. Deletes are deletes.
Besides an epoch and a type attribute we find a third one: path. This path is relative to the directory we find the recentfile in.
The order of the entries in the recentfile is by decreasing epoch attribute. These are unique floating point numbers. When the server has ntp running correctly, then the timestamps are usually reflecting a real epoch. If time is running backwards, we trump the system epoch with strictly monotonically increasing floating point timestamps and guarantee they are unique.
CORRUPTION AND RECOVERY
If the origin host breaks the promise to deliver consistent and complete recentfiles then the way back to sanity shall be achieved through traditional rsyncing between the hosts. But don't forget to report it as a bug:)
BACKGROUND
This is about speeding up rsync operation on large trees. Uses a small metadata cocktail and pull technology.
NON-COMPETITORS
File::Mirror JWU/File-Mirror/File-Mirror-0.10.tar.gz only local trees
Mirror::YAML ADAMK/Mirror-YAML-0.03.tar.gz some sort of inner circle
Net::DownloadMirror KNORR/Net-DownloadMirror-0.04.tar.gz FTP sites and stuff
Net::MirrorDir KNORR/Net-MirrorDir-0.05.tar.gz dito
Net::UploadMirror KNORR/Net-UploadMirror-0.06.tar.gz dito
Pushmi::Mirror CLKAO/Pushmi-v1.0.0.tar.gz something SVK
rsnapshot www.rsnapshot.org focus on backup
csync www.csync.org more like unison
multi-rsync sourceforge 167893 lan push to many
COMPETITORS
The problem to solve which clusters and ftp mirrors and otherwise replicated datasets like CPAN share: how to transfer only a minimum amount of data to determine the diff between two hosts.
Normally it takes a long time to determine the diff itself before it can be transferred. Known solutions at the time of this writing are csync2, and rsync 3 batch mode.
For many years the best solution was csync2 which solves the problem by maintaining a sqlite database on both ends and talking a highly sophisticated protocol to quickly determine which files to send and which to delete at any given point in time. Csync2 is often inconvenient because it is push technology and the act of syncing demands quite an intimate relationship between the sender and the receiver. This is hard to achieve in an environment of loosely coupled sites where the number of sites is large or connections are unreliable or network topology is changing.
Rsync 3 batch mode works around these problems by providing rsync-able batch files which allow receiving nodes to replay the history of the other nodes. This reduces the need to have an incestuous relation but it has the disadvantage that these batch files replicate the contents of the involved files. This seems inappropriate when the nodes already have a means of communicating over rsync.
rersyncrecent solves this problem with a couple of (usually 2-10) index files which cover different overlapping time intervals. The master writes these files and the clients/slaves can construct the full tree from the information contained in them. The most recent index file usually covers the last seconds or minutes or hours of the tree and depending on the needs, slaves can rsync every few seconds or minutes and then bring their trees in full sync.
The rersyncrecent mode was developed for CPAN but I hope it is a convenient and economic general purpose solution. I'm looking forward to see a CPAN backbone that is only a few seconds behind PAUSE. And then ... the first FUSE based CPAN filesystem anyone?
FUTURE DIRECTIONS
Currently the origin server must keep track of injected and removed files. Should be supported by an inotify-based assistant.
SEE ALSO
File::Rsync::Mirror::Recentfile, File::Rsync::Mirror::Recentfile::Done, File::Rsync::Mirror::Recentfile::FakeBigFloat
BUGS
Please report any bugs or feature requests through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Rsync-Mirror-Recent. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc File::Rsync::Mirror::Recent
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=File-Rsync-Mirror-Recent
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
Thanks to RJBS for module-starter.
AUTHOR
Andreas König
COPYRIGHT & LICENSE
Copyright 2008, 2009 Andreas König.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.