NAME
SWISH::Prog::Aggregator::Spider - web aggregator
SYNOPSIS
use SWISH::Prog::Aggregator::Spider;
my $spider = SWISH::Prog::Aggregator::Spider->new(
indexer => SWISH::Prog::Indexer->new
);
$spider->indexer->start;
$spider->crawl( 'http://swish-e.org/' );
$spider->indexer->finish;
DESCRIPTION
SWISH::Prog::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, SWISH::Prog::Aggregator::Spider uses WWW::Mechanize to the hard work. See SWISH::Prog::Aggregator::Spider::UA.
METHODS
See SWISH::Prog::Aggregator
new( params )
All params have their own get/set methods too. They include:
- use_md5
-
Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.
- uri_cache
-
Get/set the SWISH::Prog::Cache-derived object used to track which URIs have been fetched already.
- md5_cache
-
If use_md5() is true, this SWISH::Prog::cache-derived object tracks the URI fingerprints.
- queue
-
Get/set the SWISH::Prog::Queue-derived object for tracking which URIs still need to be fetched.
- ua
-
Get/set the SWISH::Prog::Aggregagor::Spider::UA object.
- max_depth
-
How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.
- delay
-
Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).
init
Initializes a new spider object. Called by new().
uri_ok( uri )
Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on it's base, robot rules, and the spider configuration.
get_doc
Returns the next URI from the queue() as a SWISH::Prog::Doc object, or the error message if there was one.
Returns undef if the queue is empty or max_depth() has been reached.
crawl( uri )
Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in depth().
AUTHOR
Peter Karman, <perl@peknet.com>
COPYRIGHT AND LICENSE
Copyright 2008 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.