NAME

RDF::Scutter - Perl extension for harvesting distributed RDF resources

SYNOPSIS

use RDF::Scutter;
use RDF::Redland;
my $scutter = RDF::Scutter->new(scutterplan => ['http://www.kjetil.kjernsmo.net/foaf.rdf','http://my.opera.com/kjetilk/xml/foaf/'], from => 'scutterer@example.invalid');

my $storage=new RDF::Redland::Storage("hashes", "rdfscutter", "new='yes',hash-type='bdb',dir='/tmp/',contexts='yes'");
my $model = $scutter->scutter($storage, 30);
my $serializer=new RDF::Redland::Serializer("ntriples");
print $serializer->serialize_model_to_string(undef,$model);

DESCRIPTION

As the name implies, this is an RDF Scutter. A scutter is a web robot that follows seeAlso-links, retrieves the content it finds at those URLs, and adds the RDF statements it finds there to its own store of RDF statements.

This module is an alpha release of such a Scutter. It builds a RDF::Redland::Model, and can add statements to any RDF::Redland::Storage that supports contexts. Among Redland storages, we find file, memory, Berkeley DB, MySQL, etc, but it is not clear to the author which of these supports contexts, and indeed, it does seem like the synopsis example doesn't work.

This class inherits from LWP::RobotUA, which again is a LWP::UserAgent and can therefore use all methods of these classes.

The latter implies it is robot that by default behaves nicely, it checks robots.txt, and sleeps between connections to make sure it doesn't overload remote servers.

It implements most of the ScutterVocab at http://rdfweb.org/topic/ScutterVocab

CAUTION

This is an alpha release, and I haven't tested what it can do if left unsupervised, and you might want to be careful about finding out... The example in the Synopsis a complete scutter, but one that will retrieve only 30 URLs before returning. You could test it by entering your own URLs (optional) and a valid email address (mandatory). It'll count and report what it is doing.

METHODS

new(scutterplan => ARRAYREF, from => EMAILADDRESS [, any LWP::RobotUA parameters])

This is the constructor of the Scutter. You will have to initialise it with a scutterplan argument, which is an ARRAYREF containing URLs pointing to RDF resources. The Scutter will start its traverse of the web there. You must also set a valid email address in a from, so that if your scutter goes amok, your victims will know who to blame.

Finally, you may supply any arguments a LWP::RobotUA and LWP::UserAgent accepts.

scutter(RDF::Redland::Storage [, MAXURLS]);

This method will launch the Scutter. As first argument, it takes a RDF::Redland::Storage object. This allows you to store your model any way Redland supports, and it is very flexible, see its documentation for details. Optionally, it takes an integer as second argument, giving the maximum number of URLs to retrieve successfully. This provides some security against a runaway robot.

It will return a RDF::Redland::Model containing a model with all statements retrieved from all visited resources.

BUGS/TODO

There are no known real bugs at the time of this writing, keeping in mind it is an alpha. If you find any, please use the CPAN Request Tracker to report them.

I'm in it slightly over my head when I try to add the ScutterVocab statements. Time will show if I have understood it correctly.

Allthough it uses LWP::Debug to debugging, the author feels it is somewhat problematic to find the right amount of output from the module. Subsequent releases are likely to be more quiet than the present release, however.

For an initial release, heeding robots.txt is actually pretty groundbreaking. However, a good robot should also make use of HTTP caching, keywords are Etags, Last-Modified and Expiry. It will be a focus of upcoming development, and many of these things are now being stated about the context in the RDF. We should find a way to detect what is being skipped due to robots.txt though.

It is not clear how long it would be running, or how it would perform if set to retrieve as much as it could. Currently, it is a serial robot, but there exists Perl modules to make parallell robots. If it is found that a serial robot is too limited, it will necessarily require attention.

SEE ALSO

RDF::Redland, LWP.

SUBVERSION REPOSITORY

This code is maintained in a Subversion repository. You may check out the trunk using e.g.

svn checkout http://svn.kjernsmo.net/RDF-Scutter/trunk/ RDF-Scutter

AUTHOR

Kjetil Kjernsmo, <kjetilk@cpan.org>

ACKNOWLEDGEMENTS

Many thanks to Dave Beckett for writing the Redland framework and for helping when the author was confused, and to Dan Brickley for interesting discussions. Also thanks to the LWP authors for their excellent library.

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Kjetil Kjernsmo

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.