NAME

Gungho - Yet Another High Performance Web Crawler Framework

SYNOPSIS

use Gungho;
Gungho->run($config);

DESCRIPTION

Gungho is Yet Another Web Crawler Framework, aimed to be an extensible and fast. Its meant to be a culmination of lessons learned while building Xango -- Xango was *fast*, but it was horribly hard to debug. Gungho tries to build from clean structures, based upon principles from the likes of Catalyst and Plagger.

All components (engine, provider, handler) are overridable and switcheable. Plugin mechanism is available to add hooks to be executed during the run.

WARNING: *ALL* APIs are still subject to change.

STRUCTURE

Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.

There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().

COMPONENTS

Components add new functionality to Gungho. Components are loaded at startup time fro the config file / hash given to Gungho constructor.

Gungho->run({
  components => [
    'Throttle::Simple'
  ],
  throttle => {
    max_interval => ...,
  }
});

Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.

INLINE

If you're looking into simple crawlers, you may want to look at Gungho::Inline,

Gungho::Inline->run({
  provider => sub { ... },
  handler  => sub { ... }
});

See the manual for Gungho::Inline for details.

HOOKS

Currently available hooks are:

engine.send_request

engine.handle_response

METHODS

new($config)

This method has been deprecated. Use run() instead.

run

Starts the Gungho process. It requires either the name of a config filename or a hashref.

has_feature($name)

Returns true if Gungho supports some feature $name

setup()

Sets up the Gungho environment, including calling the various setup_* methods to configure the provider, engine, handler, etc.

setup_components()

setup_engine()

setup_handler()

setup_log()

setup_provider()

setup_plugins()

Sets up the various components.

register_hook($hook_name => $coderef[, $hook_name => $coderef])

Registers a hook to be run under the specified $hook_name

run_hook($hook_name)

Runs all the hooks under the hook $hook_name

has_requests

Delegates to provider's has_requests

get_requests

Delegates to provider's get_requests

handle_response

Delegates to handler's handle_response

dispatch_requests

Calls provider->dispatch

prepare_request($req)

Given a request, preps it before sending it to the engine

send_request

Delegates to engine's send_request

load_config($config)

Loads the config from $config via Config::Any.

load_gungho_module($name, $prefix)

Loads a Gungho component. Compliments the module name with 'Gungho::$prefix::', unless the name is prefixed with a '+'. In that case, no transformation is performed, and the module name is used as-is.

CODE

You can obtain the current code base from

http://gungho-crawler.googlecode.com/svn/trunk

AUTHOR

Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp> All rights reserved.

CONTRIBUTORS

Kazuho Oku
Keiichi Okabe

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html