NAME
Gungho - Yet Another High Performance Web Crawler Framework
SYNOPSIS
use Gungho;
Gungho->run($config);
DESCRIPTION
Gungho is Yet Another Web Crawler Framework, aimed to be an extensible and fast. Its meant to be a culmination of lessons learned while building Xango -- Xango was *fast*, but it was horribly hard to debug. Gungho tries to build from clean structures, based upon principles from the likes of Catalyst and Plagger.
All components (engine, provider, handler) are overridable and switcheable. Plugin mechanism is available to add hooks to be executed during the run.
WARNING: *ALL* APIs are still subject to change.
STRUCTURE
Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.
There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().
COMPONENTS
Components add new functionality to Gungho. Components are loaded at startup time fro the config file / hash given to Gungho constructor.
Gungho->run({
components => [
'Throttle::Simple'
],
throttle => {
max_interval => ...,
}
});
Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.
INLINE
If you're looking into simple crawlers, you may want to look at Gungho::Inline,
Gungho::Inline->run({
provider => sub { ... },
handler => sub { ... }
});
See the manual for Gungho::Inline for details.
HOOKS
Currently available hooks are:
engine.send_request
engine.handle_response
METHODS
new($config)
This method has been deprecated. Use run() instead.
run
Starts the Gungho process. It requires either the name of a config filename or a hashref.
has_feature($name)
Returns true if Gungho supports some feature $name
setup()
Sets up the Gungho environment, including calling the various setup_* methods to configure the provider, engine, handler, etc.
setup_components()
setup_engine()
setup_handler()
setup_log()
setup_provider()
setup_plugins()
Sets up the various components.
register_hook($hook_name => $coderef[, $hook_name => $coderef])
Registers a hook to be run under the specified $hook_name
run_hook($hook_name)
Runs all the hooks under the hook $hook_name
has_requests
Delegates to provider's has_requests
get_requests
Delegates to provider's get_requests
handle_response
Delegates to handler's handle_response
dispatch_requests
Calls provider->dispatch
prepare_request($req)
Given a request, preps it before sending it to the engine
send_request
Delegates to engine's send_request
load_config($config)
Loads the config from $config via Config::Any.
load_gungho_module($name, $prefix)
Loads a Gungho component. Compliments the module name with 'Gungho::$prefix::', unless the name is prefixed with a '+'. In that case, no transformation is performed, and the module name is used as-is.
CODE
You can obtain the current code base from
http://gungho-crawler.googlecode.com/svn/trunk
AUTHOR
Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp> All rights reserved.
CONTRIBUTORS
LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html