NAME
Xango::Broker - Broker HTTP Requests
SYNOPSIS
use Xango::Broker;
use MyHandler;
MyHandler->spawn();
Xango::Broker->spawn();
POE::Kernel->run();
# or,
xango -h MyHandler
DESCRIPTION
Xango is a generic web crawler framework is written using POE (http://poe.perl.org), a cooperative multitasking framework.
Xango::Broker is Xango's main POE component but it doesn't do much by itself: Instead, you need to write a handler that does all the application-specific work where most of the interesting bits are done.
Xango::Broker is mainly responsible for three things: (1) Setting up the general environment, (2) providing the processig pipeline for the most common crawler behavior, and (3) handling the HTTP fetches as well as their states. Your handler will be part of (2) above, as the component that is responsible for the following things:
- Provide the data to fetch
-
You need to tell Xango::Broker what to fetch :)
- Handle the HTTP response.
-
...And you need to process the response that you get after Xango::Broker fetches the requested URI.
Please see the section HANDLER API for more details.
CONFIGURATION VARIABLES
Configuration variables are written in YAML format. Please see the documentation for YAML for more information on how to write the configuration file.
If your custom web crawler requires more configuration parameters, you can safely specify more stuff in the same config file, so as long as it does not clash with an already existing parameter name that is requried by Xango::Broker.
To use these configuration variables, you need to use Xango::Config:
use Xango::Config qw(filename.conf);
# or
Xango::Config->init('filename.conf');
or, you can pass it to the Xango::Broker's spawn() method :
Xango::Broker->spawn(conf => 'filename.conf');
Once initialized, you may refer to the same Xango::Config instance from anywhere in your code. Please see Xango::Config for more details.
HttpComponentClass (string)
Class name of the POE component that handles HTTP communication. You may specify any class, so as long as it has interfaces matching POE::Component::Client::HTTP.
Defaults to 'POE::Component::Client::HTTP'
HttpComponentArgs (list or hash)
Arguments that are passed to the spawn() method of the HTTP component class. You almost always want to specify the 'Timeout' parameter if you're using POE::Component::Client::HTTP (or the like)
Note that you may not specify the 'Alias' parameter. This is internally used by Xango::Broker. If you specify it, it will silently be ignored
DnsCacheClass (string)
Class name of the cache object to hold DNS query results. Defaults to Cache::FileCache.
DnsCacheArgs (hash)
Arguments to pass to the cache constructor. You must provide this if you are using anything other than Cache::FileCache as your cache class.
MaxHttpAgents (integer)
The number of concurrent http agents (i.e. the number of POE::Component::Client::HTTP sessions) that are allowed. The default is 10, but for anything other than a toy application, something in the order of 50 ~ 100 is the recommended value.
Unless this number is less than 10, the broker starts with 10 sessions, and successively grows the pool of agents when there are not enough agents to handle the currently available jobs, until the maximum is reached.
If the max is less than 10, the starting number if equal to the max.
MaxSilenceInterval (integer)
The number of seconds that we allow an agent to be inactive for. Once a fetcher session is inactive for this much amount of time, the sessions is stopped via detach_child(). The default is 300 seconds.
JobRetrievalDelay (integer)
The number of seconds to wait between calls to 'retrieve_job' state of the handler session. The default is 15 seconds.
ReloadConfig (integer)
The number of seconds to wait before reloading configuration parameters from the config file. If set to 0, reload is disabled.
HANDLER API
The handler, which is where your application specific logic goes, must implement events that are listed below.
Note that the handler must be alias appropriately, as 'handler'. Don't forget to put something like this in your handler session's _start() method so that the alias is set properly:
sub _start
{
my($kernel) = @_[KERNEL];
$kernel->alias_set('handler');
}
sub _stop
{
my($kernel) = @_[KERNEL];
$kernel->alias_remove('handler');
}
- retrieve_jobs
-
This state is responsible for retrieve jobs to be processed by Xango from wherever you decide to store your original data (RDBMS, file system, manual user input, etc).
It should return a list of hashref, which must contain at least 1 element named 'uri'. You may add any other elements, except 'id', 'fetcher', 'host_ip', and 'host_name', which are used internally by Xango. (However, you are welcome to use these values as read-only variables).
sub retrieve_jobs { while (my $uri = get_next_uri()) { push @jobs_to_be_processed, { uri => $uri, my_var => $my_var, my_other_var => $my_other_var }; } return @jobs_to_be_processed; }
This state is called as a synchronous call via POE::Kernel->call(), so don't take forever to get the jobs to be processed!
- apply_policy
-
This receives a job hash, and is supposed to figure if the particular job should be processed at all. Use this to apply black policy rules at the broker level (NOTE: if at all possible, do this at the storage level, such as a RDBMS server's stored procedure, as complicated policies will probably slow the broker down significantly).
At the very least, if you are not applying any policies, write a stub pass-through state like below so that you just call the next state in the processing chain:
sub apply_policy { my($kernel, $fetcher, $job) = @_[KERNEL, ARG0, ARG1]; $kernel->post('broker', 'send_fetcher', $fetcher, $job); }
Note, you *have* to call 'send_fetcher' in order for the job to be processed at all. If you otherwise do not wish to process this job, post to the broker session's 'finalize_job' state
sub apply_policy { my($kernel, $job) = @_[KERNEL, ARG0]; if ( $DONT_PROCESS ) { $kernel->post('broker', 'finalize_job', $job); } else { $kernel->post('broker', 'send_fetcher', $job); } }
The job hash will be available in ARG0
- handle_response
-
As the name states, this state should handle the job, after the job's URI has been fetched. The HTTP::Response object is stored under the 'http_response' slot in the job, and you are free to do whatever you want with it -- because Xango doesn't do anything else with that job after this state.
It is up to you to cook this piece of data, and store the results somewhere (or, discard them).
The job hash will be available in ARG0
- finalize_job
-
This is sort of like a destructor for the job. The broker does its own cleanup, and then sends the job to the handler's 'finalize_job' state so that application-specific cleanup can be performed.
The job hash will be available in ARG0
TODO
BUGS
Plenty, I'm sure. Please report bugs via RT http://rt.cpan.org/NoAuth/Bugs.html?Dist=Xango
SEE ALSO
AUTHOR
Copyright 2005 Daisuke Maki <dmaki@cpan.org>. All rights reserved. Development funded by Brazil, Ltd. <http://b.razil.jp>
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 808:
=back doesn't take any parameters, but you said =back 4