NAME

WWW::CheckSite::Spider - Implement a base class for spidering

SYNOPSIS

use WWW::CheckSite::Spider;

my $sp = WWW::CheckSite::Spider->new(
    ua_class => 'WWW::Mechanize', # or Win32:IE::Mechanize
    uri      => 'http://www.test-smoke.org',
);

while ( my $page = $sp->get_page ) {
    # $page is a hashref with bais information
}

DESCRIPTION

This module implements a basic web-spider, based on WWW::Mechanize. It takes care of putting pages on the "still-to-fetch" stack. Only uri's with the same origin will be stacked, taking the robots-rules on the server into account.

METHODS

CONSTATNTS & EXPORTS

The following constants ar exported on demand with the :const tag.

WCS_UNKNOWN
WCS_FOLLOWED
WCS_SPIDERED
WCS_TOSPIDER
WCS_NOCONTENT
WCS_OUTSCOPE

WWW::CheckSite::Spider->new( %opts )

Currently supported options (the rest will be passed!):

  • uri => <start_uri> [mandatory]

  • au_class => 'WWW::Mechanize' or 'Win32::IE::Mechanize'

  • exclude => <exclude_re> (qr/[#?].*$/)

  • myrules => <\@disallow>

  • lang => languages to pass to Accept-Language: header

get_page

Fetch the page and do some book keeping. It returns a hashref with some basic information:

  • org_uri Used for the request

  • ret_uri The uri returned by the server

  • depth The depth in the browse tree

  • status The return status from server

  • success shortcut for status == 200

  • is_html shortcut for ct eq 'text/html'

  • title What's in the <TITLE></TITLE> section

  • ct The content-type

_process

Private method to help not requesting pages more than once.

update_stack

This is what the spider is all about. It will examine $self->{_agent}->links() to filter the links to follow.

process_page( $uri )

Override this method to make the spider do something useful.

Filter out the uri's that will fail:

qr!^(?:mailto:|mms://|javascript:)!i

UserAgent

agent

Retruns a standard name for this UserAgent.

init_agent

Initialise the agent that is used to fetch pages. The default class is WWW::Mechanize but any class that has the same methods will do.

new_agent

Create a new agent and return it.

Robot Rules

The Spider uses the robot rules mechanism. This means that it will always get the robots.txt file from the root of the webserver to see if we are allowed (actually "not disallowed") to access pages as a robot.

You can add rules for disallowing pages by specifying a list of lines in the robots.txt syntax to @{ $self->{myrules} }.

uri_ok( $uri )

This will determine whether this uri should be spidered. Rules are simple:

  • Has the same base uri as the one we started with

  • Is not excluded by the $self->{exclude} regex.

  • Is not excluded by robots.txt mechanism

allowed( $uri )

Checks the uri against the robotrules.

init_robotrules( )

This will setup a <WWW::RobotRules> object. @{$self->{myrules } is used to add rules and should be in the RobotRules format. These rules are added to the ones found in robots.txt.

AUTHOR

Abe Timmerman, <abeltje@cpan.org>

BUGS

Please report any bugs or feature requests to bug-www-checksite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT & LICENSE

Copyright MMV Abe Timmerman, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.