NAME
Gungho - Yet Another High Performance Web Crawler Framework
SYNOPSIS
use Gungho;
Gungho->run($config);
DESCRIPTION
Gungho is Yet Another Web Crawler Framework, aimed to be extensible and fast. Its meant to be a culmination of lessons learned while building Xango -- Xango was *fast*, but it was horribly hard to debug or to extend (Gungho even works right out of the box ;)
Therefore, Gungho's main aim is to make it as easy as possible to write complex crawlers, while still keeping crawling *fast*. You can simply specify the urls to fetch and some code to handle the responses -- we do the rest.
Gungho tries to build from clean structures, based upon principles from the likes of Catalyst and Plagger, so that you can easily extend it to your liking.
Features such as robot rules handling (robots.txt) and request throttling can be removed/added on the fly, just by specifying the components that you want to load. You can easily create additional functionality by writing your own component.
Gungho is still very fast -- it uses event driven frameworks such as POE, Danga::Socket, and IO::Async as the main engine to drive requests. Choose the best engine for your needs: For example, if you plan on creating a POE-based handler to process the response, you might choose the POE engine - it will fit nicely into the request cycle. However, do note that the most heavily excercised engine is POE. Danga::Socket and IO::Async works, but haven't been tested too vigorously. Please send in requests and bug reports if you encounter any problems.
WARNING: *ALL* APIs are still subject to change.
STRUCTURE
Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.
There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().
All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use.
CONFIGURATION OPTIONS
- debug
-
--- debug: 1
Setting debug to a non-zero value will trigger debug messages to be displayed.
- block_private_ip_address
-
--- block_private_ip_address: 1
Setting this to a non-zero value will make addresses resolved via DNS lookups to be blocked, if they resolved to a private IP address such as 192.168.1.1. Note that 127.0.0.1 is also considered a private IP.
COMPONENTS
Components add new functionality to Gungho. Components are loaded at startup time fro the config file / hash given to Gungho constructor.
Gungho->run({
components => [
'Throttle::Simple'
],
throttle => {
max_interval => ...,
}
});
Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.
Here are some available components. Checkout the distribution for a current, complete list:
- RobotRules
-
Handles collecting, parsing robots.txt, as well rejecting requests based on the rules provided from it.
- Authentication::Basic
-
Handles basic auth automatically.
- Throttle::Domain
-
Throttles requests based on the number of requests sent to a domain.
INLINE
If you're looking into simple crawlers, you may want to look at Gungho::Inline,
Gungho::Inline->run({
provider => sub { ... },
handler => sub { ... }
});
See the manual for Gungho::Inline for details.
HOOKS
Currently available hooks are:
engine.send_request
engine.handle_response
METHODS
new($config)
This method has been deprecated. Use run() instead.
run
Starts the Gungho process. It requires either the name of a config filename or a hashref.
has_feature($name)
Returns true if Gungho supports some feature $name
setup()
Sets up the Gungho environment, including calling the various setup_* methods to configure the provider, engine, handler, etc.
setup_components()
setup_engine()
setup_handler()
setup_log()
setup_provider()
setup_plugins()
Sets up the various components.
register_hook($hook_name => $coderef[, $hook_name => $coderef])
Registers a hook to be run under the specified $hook_name
run_hook($hook_name)
Runs all the hooks under the hook $hook_name
has_requests
Delegates to provider's has_requests
get_requests
Delegates to provider's get_requests
handle_response
Delegates to handler's handle_response
dispatch_requests
Calls provider->dispatch
prepare_request($req)
Given a request, preps it before sending it to the engine
send_request
Delegates to engine's send_request
load_config($config)
Loads the config from $config via Config::Any.
load_gungho_module($name, $prefix)
Loads a Gungho component. Compliments the module name with 'Gungho::$prefix::', unless the name is prefixed with a '+'. In that case, no transformation is performed, and the module name is used as-is.
HOW *NOT* TO USE Gungho
One last note about Gungho - Don't use it if you are planning on accessing a single url -- It's usually not worth it, so you might as well use LWP::UserAgent or an equivalent module.
Gungho's event driven engine works best when you are accessing hundreds, if not thousands of urls. It may in fact be slower than using LWP::UserAgent if you are accessing just a single url.
Of course, you may wish to utilize features other than speed that Gungho provides, so at that point, it's simply up to you.
CODE
You can obtain the current code base from
http://gungho-crawler.googlecode.com/svn/trunk
AUTHOR
Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
CONTRIBUTORS
SEE ALSO
Gungho::Inline Gungho::Component::RobotRules
LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html