NAME

Gungho - Yet Another High Performance Web Crawler Framework

SYNOPSIS

use Gungho;
Gungho->run($config);

DESCRIPTION

Gungho is Yet Another Web Crawler Framework, aimed to be extensible and fast. Its meant to be a culmination of lessons learned while building Xango -- Xango was *fast*, but it was horribly hard to debug or to extend (Gungho even works right out of the box ;)

Therefore, Gungho's main aim is to make it as easy as possible to write complex crawlers, while still keeping crawling *fast*. You can simply specify the urls to fetch and some code to handle the responses -- we do the rest.

Gungho tries to build from clean structures, based upon principles from the likes of Catalyst and Plagger, so that you can easily extend it to your liking.

Features such as robot rules handling (robots.txt) and request throttling can be removed/added on the fly, just by specifying the components that you want to load. You can easily create additional functionality by writing your own component.

Gungho is still very fast -- it uses event driven frameworks such as POE, Danga::Socket, and IO::Async as the main engine to drive requests. Choose the best engine for your needs: For example, if you plan on creating a POE-based handler to process the response, you might choose the POE engine - it will fit nicely into the request cycle. However, do note that the most heavily excercised engine is POE. Danga::Socket and IO::Async works, but haven't been tested too vigorously. Please send in requests and bug reports if you encounter any problems.

WARNING: *ALL* APIs are still subject to change.

STRUCTURE

Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.

There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().

All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use.

CONFIGURATION OPTIONS

debug
---
debug: 1

Setting debug to a non-zero value will trigger debug messages to be displayed.

block_private_ip_address
---
block_private_ip_address: 1

Setting this to a non-zero value will make addresses resolved via DNS lookups to be blocked, if they resolved to a private IP address such as 192.168.1.1. Note that 127.0.0.1 is also considered a private IP.

COMPONENTS

Components add new functionality to Gungho. Components are loaded at startup time fro the config file / hash given to Gungho constructor.

Gungho->run({
  components => [
    'Throttle::Simple'
  ],
  throttle => {
    max_interval => ...,
  }
});

Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.

Here are some available components. Checkout the distribution for a current, complete list:

RobotRules

Handles collecting, parsing robots.txt, as well rejecting requests based on the rules provided from it.

Authentication::Basic

Handles basic auth automatically.

Throttle::Domain

Throttles requests based on the number of requests sent to a domain.

INLINE

If you're looking into simple crawlers, you may want to look at Gungho::Inline,

Gungho::Inline->run({
  provider => sub { ... },
  handler  => sub { ... }
});

See the manual for Gungho::Inline for details.

HOOKS

Currently available hooks are:

engine.send_request

engine.handle_response

METHODS

new($config)

This method has been deprecated. Use run() instead.

run

Starts the Gungho process. It requires either the name of a config filename or a hashref.

has_feature($name)

Returns true if Gungho supports some feature $name

setup()

Sets up the Gungho environment, including calling the various setup_* methods to configure the provider, engine, handler, etc.

setup_components()

setup_engine()

setup_handler()

setup_log()

setup_provider()

setup_plugins()

Sets up the various components.

register_hook($hook_name => $coderef[, $hook_name => $coderef])

Registers a hook to be run under the specified $hook_name

run_hook($hook_name)

Runs all the hooks under the hook $hook_name

has_requests

Delegates to provider's has_requests

get_requests

Delegates to provider's get_requests

handle_response

Delegates to handler's handle_response

dispatch_requests

Calls provider->dispatch

prepare_request($req)

Given a request, preps it before sending it to the engine

send_request

Delegates to engine's send_request

load_config($config)

Loads the config from $config via Config::Any.

load_gungho_module($name, $prefix)

Loads a Gungho component. Compliments the module name with 'Gungho::$prefix::', unless the name is prefixed with a '+'. In that case, no transformation is performed, and the module name is used as-is.

HOW *NOT* TO USE Gungho

One last note about Gungho - Don't use it if you are planning on accessing a single url -- It's usually not worth it, so you might as well use LWP::UserAgent or an equivalent module.

Gungho's event driven engine works best when you are accessing hundreds, if not thousands of urls. It may in fact be slower than using LWP::UserAgent if you are accessing just a single url.

Of course, you may wish to utilize features other than speed that Gungho provides, so at that point, it's simply up to you.

CODE

You can obtain the current code base from

http://gungho-crawler.googlecode.com/svn/trunk

AUTHOR

Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>

CONTRIBUTORS

Kazuho Oku
Keiichi Okabe

SEE ALSO

Gungho::Inline Gungho::Component::RobotRules

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html