NAME
WWW::CheckSite::Spider - A base class for spidering the web
SYNOPSIS
use WWW::CheckSite::Spider;
my $sp = WWW::CheckSite::Spider->new(
uri => 'http://www.test-smoke.org',
);
while ( my $page = $sp->get_page ) {
# $page is a hashref with basic information
}
or to spider a site behind HTTP basic authentication:
package BA_Mech;
use base 'WWW::Mechanize';
sub get_basic_credentials { ( 'abeltje', '********' ) }
package main;
use WWW::CheckSite::Spider;
my $sp = WWW::CheckSite::Spider->new(
ua_class => 'BA_Mech',
uri => 'http://your.site.with.ba/',
);
while ( my $page = $sp->get_page ) {
# $page is a hashref with basic information
}
DESCRIPTION
This module implements a basic web-spider, based on WWW::Mechanize
. It takes care of putting pages on the "still-to-fetch" stack. Only uri's with the same origin will be stacked, taking the robots-rules on the server into account.
CONSTATNTS & EXPORTS
The following constants ar exported on demand with the :const tag.
- WCS_UNKNOWN
- WCS_FOLLOWED
- WCS_SPIDERED
- WCS_TOSPIDER
- WCS_TOFOLLOW
- WCS_NOCONTENT
- WCS_OUTSCOPE
METHODS
WWW::CheckSite::Spider->new( %opts )
Currently supported options (the rest will be set but not used!):
uri => <start_uri> || <\@start_uri> [mandatory]
ua_class => by default WWW::Mechanize
exclude => <exclude_re> (qr/[#?].*$/)
myrules => <\@disallow>
lang => languages to pass to Accept-Language: header
$spider->get_page
Fetch the page and do some book keeping. It returns the result of $pider->process_page()
.
$spider->process_page( $uri )
Override this method to make the spider do something useful. By default it returns:
org_uri Used for the request
ret_uri The uri returned by the server
depth The depth in the browse tree
status The return status from server
success shortcut for status == 200
is_html shortcut for ct eq 'text/html'
title What's in the <TITLE></TITLE> section
ct The content-type
$spider->links_filtered
Filter out the uri's that will fail:
qr!^(?:mailto:|mms://|javascript:)!i
$spider->filter_link( $uri )
Return the URI to be spidered or undef
for skipping.
$spider->strip_uri( $uri )
Strip the fragment bit of the $uri.
USERAGENT METHODS
$spider->agent
Retruns a standard name for this UserAgent.
$spider->init_agent
Initialise the agent that is used to fetch pages. The default class is WWW::Mechanize
but any class that has the same methods will do.
The ua_class
needs to support the following methods (see WWW::Mechanize for more information about these):
- new
- get
- base
- uri
- status
- success
- ct
- is_html
- title
- links
- HEAD (for WWW::CheckSite::Validator)
- content (for WWW::CheckSite::Validator)
- images (for WWW::CheckSite::Validator)
$spider->current_agent
Return the current user agent.
$spider->new_agent
Create a new agent and return it.
ROBOTRULES METHODS
The Spider uses the robot rules mechanism. This means that it will always get the /robots.txt file from the root of the webserver to see if we are allowed (actually "not disallowed") to access pages as a robot.
You can add rules for disallowing pages by specifying a list of lines in the robots.txt syntax to @{ $self->{myrules} }
.
$spider->more_rrules( $url )
Check to see if the robots.txt file for this $url
has already been loaded. If not, fetch the file and add the rules to the $self->{_r_rules}
object.
$spider->uri_ok( $uri )
This will determine whether this uri should be spidered. Rules are simple:
Has the same base uri as the one we started with
Is not excluded by the
$self->{exclude}
regex.Is not excluded by robots.txt mechanism
$spider->allowed( $uri )
Checks the uri against the robotrules.
$spider->init_robotrules( )
This will setup a <WWW::RobotRules> object. @{$self->{myrules }
is used to add rules and should be in the RobotRules format. These rules are added to the ones found in robots.txt.
$spider->current_rrules
Returns the current RobotRules object.
AUTHOR
Abe Timmerman, <abeltje@cpan.org>
BUGS
Please report any bugs or feature requests to bug-www-checksite@rt.cpan.org
, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
COPYRIGHT & LICENSE
Copyright MMV Abe Timmerman, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.