NAME
WWW::Robot - configurable web traversal engine (for web robots & agents)
SYNOPSIS
use WWW::Robot;
$robot = new WWW::Robot('NAME' => 'MyRobot',
'VERSION' => '1.000',
'EMAIL' => 'fred@foobar.com');
# ... configure the robot's operation ...
$robot->run('http://www.foobar.com/');
DESCRIPTION
This module implements a configurable web traversal engine, for a robot or other web agent. Given an initial web page (URL), the Robot will get the contents of that page, and extract all links on the page, adding them to a list of URLs to visit.
Features of the Robot module include:
Follows the Robot Exclusion Protocol.
Supports the META element proposed extensions to the Protocol.
Implements many of the Guidelines for Robot Writers.
Configurable.
Builds on standard Perl 5 modules for WWW, HTTP, HTML, etc.
A particular application (robot instance) has to configure the engine using hooks, which are perl functions invoked by the Robot engine at specific points in the control loop.
The robot engine obeys the Robot Exclusion protocol, as well as a proposed addition. See "SEE ALSO" for references to documents describing the Robot Exclusion protocol and web robots.
QUESTIONS
This section contains a number of questions. I'm interested in hearing what people think, and what you've done faced with similar questions.
What style of API is preferable for setting attributes? Maybe something like the following:
$robot->verbose(1); $traversal = $robot->traversal();
I.e. a method for setting and getting each attribute, depending on whether you passed an argument?
Should the robot module support a standard logging mechanism? For example, an LOGFILE attribute, which is set to either a filename, or a filehandle reference. This would need a useful file format.
Should the AGENT be an attribute, so you can set this to whatever UserAgent object you want to use? Then if the attribute is not set by the first time the
run()
method is invoked, we'd fall back on the default.Should TMPDIR and WORKFILE be attributes? I don't see any big reason why they should, but someone else's application might benefit?
Should the module also support an ERRLOG attribute, with all warnings and error messages sent there?
At the moment the robot will print warnings and error messages to stderr, as well as returning error status. Should this behaviour be configurable? I.e. the ability to turn off warnings.
The basic architecture of the Robot is as follows:
Hook: restore-state
Get Next URL
Hook: invoke-on-all-url
Hook: follow-url-test
Hook: invoke-on-follow-url
Get contents of URL
Hook: invoke-on-contents
Skip if not HTML
Foreach link on page:
Hook: invoke-on-link
Add link to robot's queue
Continue? Hook: continue-test
Hook: save-state
Hook: generate-report
Each of the hook procedures and functions is described below. A robot must provide a follow-url-test
hook, and at least one of the following:
invoke-on-all-url
invoke-on-followed-url
invoke-on-contents
invoke-on-link
CONSTRUCTOR
$robot = new WWW::Robot( <attribute-value-pairs> );
Create a new robot engine instance. If the constructor fails for any reason, a warning message will be printed, and undef
will be returned.
Having created a new robot, it should be configured using the methods described below. Certain attributes of the Robot can be set during creation; they can be (re)set after creation, using the setAttribute()
method.
The attributes of the Robot are described below, in the Robot Attributes section.
METHODS
run
$robot->run( LIST );
Invokes the robot, initially traversing the root URLs provided in LIST, and any which have been provided with the addUrl()
method before invoking run()
. If you have not correctly configured the robot, the method will return undef
.
The initial set of URLs can either be passed as arguments to the run() method, or with the addUrl() method before you invoke run(). Each URL can be specified either as a string, or as a URI::URL object.
Before invoking this method, you should have provided at least some of the hook functions. See the example given in the EXAMPLES section below.
By default the run() method will iterate until there are no more URLs in the queue. You can override this behavior by providing a continue-test
hook function, which checks for the termination conditions. This particular hook function, and use of hook functions in general, are described below.
setAttribute
$robot->setAttribute( ... attribute-value-pairs ... );
Change the value of one or more robot attributes. Attributes are identified using a string, and take scalar values. For example, to specify the name of your robot, you set the NAME
attribute:
$robot->setAttribute('NAME' => 'WebStud');
The supported attributes for the Robot module are listed below, in the ROBOT ATTRIBUTES section.
getAttribute
$value = $robot->getAttribute('attribute-name');
Queries a Robot for the value of an attribute. For example, to query the version number of your robot, you would get the VERSION
attribute:
$version = $robot->getAttribute('VERSION');
The supported attributes for the Robot module are listed below, in the ROBOT ATTRIBUTES section.
addUrl
$robot->addUrl( $url1, ..., $urlN );
Used to add one or more URLs to the queue for the robot. Each URL can be passed as a simple string, or as a URI::URL object.
Returns True (non-zero) if all URLs were successfully added, False (zero) if at least one of the URLs could not be added.
addHook
$robot->addHook($hook_name, \&hook_function);
sub hook_function { ... }
Register a hook function which should be invoked by the robot at a specific point in the control flow. There are a number of hook points in the robot, which are identified by a string. For a list of hook points, see the SUPPORTED HOOKS section below.
If you provide more than one function for a particular hook, then the hook functions will be invoked in the order they were added. I.e. the first hook function called will be the first hook function you added.
proxy, no_proxy, env_proxy
These are convenience functions are setting proxy information on the User agent being used to make the requests.
$robot->proxy( protocol, proxy );
Used to specify a proxy for the given scheme. The protocol argument can be a reference to a list of protocols.
$robot->no_proxy(domain1, ... domainN);
Specifies that proxies should not be used for the specified domains or hosts.
$robot->env_proxy();
Load proxy settings from protocol_proxy environment variables: ftp_proxy
, http_proxy
, no_proxy
, etc.
ROBOT ATTRIBUTES
This section lists the attributes used to configure a Robot object. Attributes are set using the setAttribute()
method, and queried using the getAttribute()
method.
Some of the attributes must be set before you start the Robot (with the run()
method). These are marked as mandatory in the list below.
- NAME
-
The name of the Robot. This should be a sequence of alphanumeric characters, and is used to identify your Robot. This is used to set the
User-Agent
field of HTTP requests, and so will appear in server logs.mandatory
- VERSION
-
The version number of your Robot. This should be a floating point number, in the format N.NNN.
mandatory
-
A valid email address which can be used to contact the Robot's owner, for example by someone who wishes to complain about the behavior of your robot.
mandatory
- VERBOSE
-
A boolean flag which specifies whether the Robot should display verbose status information as it runs.
Default: 0 (false)
- TRAVERSAL
-
Specifies what traversal style should be adopted by the Robot. Valid values are depth and breadth.
Default: depth
- REQUEST_DELAY
-
Specifies whether the delay (in minutes) between successive GETs from the same server.
Default: 1
- IGNORE_TEXT
-
Specifies whether the HTML structure passed to the invoke-on-contents hook function should include the textual content of the page, or just the HTML elements.
Default: 1 (true)
SUPPORTED HOOKS
This section lists the hooks which are supported by the WWW::Robot module. The first two arguments passed to a hook function are always the Robot object followed by the name of the hook being invoked. I.e. the start of a hook function should look something like:
sub my_hook_function
{
my $robot = shift;
my $hook = shift;
# ... other, hook-specific, arguments
Wherever a hook function is passed a $url
argument, this will be a URI::URL object, with the URL fully specified. I.e. even if the URL was seen in a relative link, it will be passed as an absolute URL.
restore-state
sub hook { my($robot, $hook_name) = @_; }
This hook is invoked just before entering the main iterative loop of the robot. The intention is that the hook will be used to restore state, if such an operation is required.
This can be helpful if the robot is running in an incremental mode, where state is saved between each run of the robot.
invoke-on-all-url
sub hook { my($robot, $hook_name, $url) = @_; }
This hook is invoked on all URLs seen by the robot, regardless of whether the URL is actually traversed. In addition to the standard $robot
and $hook
arguments, the third argument is $url
, which is the URL being travered by the robot.
For a given URL, the hook function will be invoked at most once, regardless of how many times the URL is seen by the Robot. If you are interested in seeing the URL every time, you can use the invoke-on-link hook.
follow-url-test
sub hook { my($robot, $hook_name, $url) = @_; return $boolean; }
This hook is invoked to determine whether the robot should traverse the given URL. If the hook function returns 0 (zero), then the robot will do nothing further with the URL. If the hook function returns non-zero, then the robot will get the contents of the URL, invoke further hooks, and extract links if the contents are HTML.
invoke-on-followed-url
sub hook { my($robot, $hook_name, $url) = @_; }
This hook is invoked on URLs which are about to be traversed by the robot; i.e. URLs which have passed the follow-url-test hook.
invoke-on-get-error
sub hook { my($robot, $hook_name, $url, $response) = @_; }
This hook is invoked if the Robot ever fails to get the contents of a URL. The $response
argument is an object of type HTTP::Response.
invoke-on-contents
sub hook { my($robot, $hook, $url, $response, $structure, $filename) = @_; }
This hook function is invoked for all URLs for which the contents are successfully retrieved.
The $url
argument is a URI::URL object for the URL currently being processed by the Robot engine.
The $response
argument is an HTTP::Response object, the result of the GET request on the URL.
The $structure
argument is an HTML::Element object which is the root of a tree structure constructed from the contents of the URL. You can set the IGNORE_TEXT
attribute to specify whether the structure passed includes the textual content of the page, or just the HTML elements.
The $filename
argument is the path to a local temporary file which contains a local copy of the URL contents. You cannot assume that the file will exist after control has returned from your hook function.
invoke-on-link
sub hook { my($robot, $hook_name, $from_url, $to_url) = @_; }
This hook function is invoked for all links seen as the robot traverses. When the robot is parsing a page ($from_url) for links, for every link seen the invoke-on-link hook is invoked with the URL of the source page, and the destination URL. The destination URL is in canonical form.
continue-test
sub hook { my($robot) = @_; }
This hook is invoked at the end of the robot's main iterative loop. If the hook function returns non-zero, then the robot will continue execution with the next URL. If the hook function returns zero, then the Robot will terminate the main loop, and close down after invoking the following two hooks.
If no continue-test
hook function is provided, then the robot will always loop.
save-state
sub hook { my($robot) = @_; }
This hook is used to save any state information required by the robot application.
generate-report
sub hook { my($robot) = @_; }
This hook is used to generate a report for the run of the robot, if such is desired.
modified-since
If you provide this hook function, it will be invoked for each URL before the robot actually requests it. The function can return a time to use with the If-Modified-Since HTTP header. This can be used by a robot to only process those pages which have changed since the last visit.
Your hook function should be declared as follows:
sub modifed_since_hook
{
my $robot = shift; # instance of Robot module
my $hook = shift; # name of hook invoked
my $url = shift; # URI::URL for the url in question
# ... calculate time ...
return $time;
}
If your function returns anything other than undef
, then a If-Modified-Since: field will be added to the request header.
invoke-after-get
This hook function is invoked immediately after the robot makes each GET request. This means your hook function will see every type of response, not just successful GETs. The hook function is passed two arguments: the $url
we tried to GET, and the $response
which resulted.
If you provided a modified-since hook, then provide an invoke-after-get function, and look for error code 304 (or RC_NOT_MODIFIED if you are using HTTP::Status, which you should be :-):
sub after_get_hook
{
my($robot, $hook, $url, $response) = @_;
if ($response->code == RC_NOT_MODIFIED)
{
}
}
EXAMPLES
This section illustrates use of the Robot module, with code snippets from several sample Robot applications. The code here is not intended to show the right way to code a web robot, but just illustrates the API for using the Robot.
Validating Robot
This is a simple robot which you could use to validate your web site. The robot uses weblint to check the contents of URLs of type text/html
#!/usr/bin/perl
require 5.002;
use WWW::Robot;
$rootDocument = $ARGV[0];
$robot = new WWW::Robot('NAME' => 'Validator',
'VERSION' => 1.000,
'EMAIL' => 'fred@foobar.com');
$robot->addHook('follow-url-test', \&follow_test);
$robot->addHook('invoke-on-contents', \&validate_contents);
$robot->run($rootDocument);
#-------------------------------------------------------
sub follow_test {
my($robot, $hook, $url) = @_;
return 0 unless $url->scheme eq 'http';
return 0 if $url =~ /\.(gif|jpg|png|xbm|au|wav|mpg)$/;
#---- we're only interested in pages on our site ----
return $url =~ /^$rootDocument/;
}
#-------------------------------------------------------
sub validate_contents {
my($robot, $hook, $url, $response, $filename) = @_;
return unless $response->content_type eq 'text/html';
print STDERR "\n$url\n";
#---- run weblint on local copy of URL contents -----
system("weblint -s $filename");
}
If you are behind a firewall, then you will have to add something like the following, just before calling the run()
method:
$robot->proxy(['ftp', 'http', 'wais', 'gopher'],
'http://firewall:8080/');
MODULE DEPENDENCIES
The Robot.pm module builds on a lot of existing Net, WWW and other Perl modules. Some of the modules are part of the core Perl distribution, and the latest versions of all modules are available from the Comprehensive Perl Archive Network (CPAN). The modules used are:
- HTTP::Request
-
This module is used to construct HTTP requests, when retrieving the contents of a URL, or using the HEAD request to see if a URL exists.
- HTML::Parse
-
This module builds a tree data structure from the contents of an HTML page. This is used to extract the URLs from the links on a page. This is also used to check for page-specific Robot exclusion commands, using the META element.
- URI::URL
-
This module implements a class for URL objects, providing resolution of relative URLs, and access to the different components of a URL.
- LWP::RobotUA
-
This is a wrapper around the LWP::UserAgent class. A UserAgent is used to connect to servers over the network, and make requests. The RobotUA module provides transparent compliance with the Robot Exclusion Protocol.
- HTTP::Status
-
This has definitions for HTTP response codes, so you can say RC_NOT_MODIFIED instead of 304.
All of these modules are available as part of the libwww-perl5 distribution, which is also available from CPAN.
SEE ALSO
- The SAS Group Home Page
-
http://www.cre.canon.co.uk/sas.html
This is the home page of the Group at Canon Research Centre Europe who are responsible for Robot.pm.
- Robot Exclusion Protocol
-
http://info.webcrawler.com/mak/projects/robots/norobots.html
This is a de facto standard which defines how a `well behaved' Robot client should interact with web servers and web pages.
- Guidelines for Robot Writers
-
http://info.webcrawler.com/mak/projects/robots/guidelines.html
Guidelines and suggestions for those who are (considering) developing a web robot.
- Weblint Home Page
-
http://www.cre.canon.co.uk/~neilb/weblint/
Weblint is a perl script which is used to check HTML for syntax errors and stylistic problems, in the same way lint is used to check C.
- Comprehensive Perl Archive Network (CPAN)
-
http://www.perl.com/perl/CPAN/
This is a well-organized collection of Perl resources, such as modules, documents, and scripts. CPAN is mirrored at FTP sites around the world.
VERSION
This documentation describes version 0.010 of the Robot module. The module requires at least version 5.002 of Perl.
AUTHOR
SAS Group, Canon Research Centre Europe
COPYRIGHT
Copyright (C) 1997, Canon Research Centre Europe.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.