NAME
WWW::CheckSite::Spider - Implement a base class for spidering
SYNOPSIS
use WWW::CheckSite::Spider;
my $sp = WWW::CheckSite::Spider->new(
ua_class => 'WWW::Mechanize', # or Win32:IE::Mechanize
uri => 'http://www.test-smoke.org',
);
while ( my $page = $sp->get_page ) {
# $page is a hashref with bais information
}
DESCRIPTION
This module implements a basic web-spider, based on WWW::Mechanize
. It takes care of putting pages on the "still-to-fetch" stack. Only uri's with the same origin will be stacked, taking the robots-rules on the server into account.
METHODS
CONSTATNTS & EXPORTS
The following constants ar exported on demand with the :const tag.
- WCS_UNKNOWN
- WCS_FOLLOWED
- WCS_SPIDERED
- WCS_TOSPIDER
- WCS_NOCONTENT
- WCS_OUTSCOPE
WWW::CheckSite::Spider->new( %opts )
Currently supported options (the rest will be passed!):
uri => <start_uri> [mandatory]
au_class => 'WWW::Mechanize' or 'Win32::IE::Mechanize'
exclude => <exclude_re> (qr/[#?].*$/)
myrules => <\@disallow>
lang => languages to pass to Accept-Language: header
get_page
Fetch the page and do some book keeping. It returns a hashref with some basic information:
org_uri Used for the request
ret_uri The uri returned by the server
depth The depth in the browse tree
status The return status from server
success shortcut for status == 200
is_html shortcut for ct eq 'text/html'
title What's in the <TITLE></TITLE> section
ct The content-type
_process
Private method to help not requesting pages more than once.
update_stack
This is what the spider is all about. It will examine $self->{_agent}->links()
to filter the links to follow.
process_page( $uri )
Override this method to make the spider do something useful.
links_filtered
Filter out the uri's that will fail:
qr!^(?:mailto:|mms://|javascript:)!i
UserAgent
agent
Retruns a standard name for this UserAgent.
init_agent
Initialise the agent that is used to fetch pages. The default class is WWW::Mechanize
but any class that has the same methods will do.
new_agent
Create a new agent and return it.
Robot Rules
The Spider uses the robot rules mechanism. This means that it will always get the robots.txt file from the root of the webserver to see if we are allowed (actually "not disallowed") to access pages as a robot.
You can add rules for disallowing pages by specifying a list of lines in the robots.txt syntax to @{ $self->{myrules} }
.
uri_ok( $uri )
This will determine whether this uri should be spidered. Rules are simple:
Has the same base uri as the one we started with
Is not excluded by the
$self->{exclude}
regex.Is not excluded by robots.txt mechanism
allowed( $uri )
Checks the uri against the robotrules.
init_robotrules( )
This will setup a <WWW::RobotRules> object. @{$self->{myrules }
is used to add rules and should be in the RobotRules format. These rules are added to the ones found in robots.txt.
AUTHOR
Abe Timmerman, <abeltje@cpan.org>
BUGS
Please report any bugs or feature requests to bug-www-checksite@rt.cpan.org
, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
COPYRIGHT & LICENSE
Copyright MMV Abe Timmerman, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.