NAME

WWW::CheckSite::Spider - Implement a base class for spidering

SYNOPSIS

use WWW::CheckSite::Spider;

my $sp = WWW::CheckSite::Spider->new(
    ua_class => 'WWW::Mechanize', # or Win32:IE::Mechanize
    uri      => 'http://www.test-smoke.org',
);

while ( my $page = $sp->get_page ) {
    # $page is a hashref with bais information
}

DESCRIPTION

This module implements a basic web-spider, based on WWW::Mechanize. It takes care of putting pages on the "still-to-fetch" stack. Only uri's with the same origin will be stacked, taking the robots-rules on the server into account.

METHODS

CONSTATNTS & EXPORTS

The following constants ar exported on demand with the :const tag.

WCS_UNKNOWN
WCS_FOLLOWED
WCS_SPIDERED
WCS_TOSPIDER
WCS_NOCONTENT
WCS_OUTSCOPE

WWW::CheckSite::Spider->new( %opts )

Currently supported options (the rest will be passed!):

uri => <start_uri> [mandatory]
au_class => 'WWW::Mechanize' or 'Win32::IE::Mechanize'
exclude => <exclude_re> (qr/[#?].*$/)
myrules => <\@disallow>
lang => languages to pass to Accept-Language: header

get_page

Fetch the page and do some book keeping. It returns a hashref with some basic information:

org_uri Used for the request
ret_uri The uri returned by the server
depth The depth in the browse tree
status The return status from server
success shortcut for status == 200
is_html shortcut for ct eq 'text/html'
title What's in the <TITLE></TITLE> section
ct The content-type

_process

Private method to help not requesting pages more than once.

update_stack

This is what the spider is all about. It will examine $self->{_agent}->links() to filter the links to follow.

process_page( $uri )

Override this method to make the spider do something useful.

links_filtered

Filter out the uri's that will fail:

qr!^(?:mailto:|mms://|javascript:)!i

UserAgent

agent

Retruns a standard name for this UserAgent.

init_agent

Initialise the agent that is used to fetch pages. The default class is WWW::Mechanize but any class that has the same methods will do.

new_agent

Create a new agent and return it.

Robot Rules

The Spider uses the robot rules mechanism. This means that it will always get the robots.txt file from the root of the webserver to see if we are allowed (actually "not disallowed") to access pages as a robot.

You can add rules for disallowing pages by specifying a list of lines in the robots.txt syntax to @{ $self->{myrules} }.

uri_ok( $uri )

This will determine whether this uri should be spidered. Rules are simple:

Has the same base uri as the one we started with
Is not excluded by the $self->{exclude} regex.
Is not excluded by robots.txt mechanism

allowed( $uri )

Checks the uri against the robotrules.

init_robotrules( )

This will setup a <WWW::RobotRules> object. @{$self->{myrules } is used to add rules and should be in the RobotRules format. These rules are added to the ones found in robots.txt.

AUTHOR

Abe Timmerman, <abeltje@cpan.org>

BUGS

Please report any bugs or feature requests to bug-www-checksite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

To install WWW::CheckSite, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::CheckSite

CPAN shell

perl -MCPAN -e shell
install WWW::CheckSite

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)