NAME

WWW::Sitemapper - Create site map by scanning a web site.

VERSION

Version 0.01

SYNOPSIS

WWW::Sitemapper - Create site map by scanning a web site.

package MyWebSite::Map;
use Moose;

use base qw( WWW::Sitemapper );

sub _build_robot_config {
    my $self = shift;

    return {
        NAME => 'MyRobot',
        EMAIL => 'me@domain.tld',
    };
}

# you need to provide a follow-url-test hook in your subclass
sub url_test : Hook('follow-url-test') {
    my $self = shift;
    my ($robot, $hook_name, $uri) = @_;

    my @restricted = (
        qr{^/cat/login},
        qr{^/cat/events},
        qr{\?_search_string=},
    );

    my $url = $uri->path_query;

    if ( $self->site->host eq $uri->host ) {
        for my $re ( @restricted ) {
            if ( $url =~ /$re/ ) {
                return 0;
            }
        }

        return 1;
    }

    return 0;
}

sub run_till_first_auto_save : Hook('continue-test') {
    my $self = shift;
    my ($robot) = @_;

    if ( $self->_run_started_time + $self->auto_save < DateTime->now ) {
        return 0;
    }
    return 1;
}


# as this is your class you may define your own methods as well
sub ping_google {
    my $self    = shift;

    my $ua = LWP::UserAgent;
    return $ua->get( 'http://www.google.com/webmasters/sitemaps/ping',
        sitemap => $self->site .'google-sitemap.xml.gz'
    );
}

and then

package main;

my $mapper = MyWebSite::Map->new(
    site => 'http://mywebsite.com/',
    status_storage => 'sitemap.data.storable',
    auto_save => 10,
);

$mapper->run;


open(HTML, ">./sitemap.html") or die ("Cannot create sitemap.html: $!");
print HTML $mapper->html_sitemap;
close(HTML);

my $xml_sitemap = $mapper->xml_sitemap(
    priority => '0.7',
    changefreq => 'weekly;
);

$xml_sitemap->write('google-sitemap.xml.gz');

# call your own method
$mapper->ping_google();

and while mapper is still running take a peek what has been mapped

my $mapper = MyWebSite::Map->new(
    site => 'http://mywebsite.com/',
    status_storage => 'sitemap.data.storable',
);

$mapper->restore_state();

print $mapper->txt_sitemap();

ATTRIBUTES

site

Home page of the website to be mapped.

tree

Tree structure of the web site.

robot_config

WWW::Robot configuration options.

You need to define in your subclass builder method _build_robot_config which needs to return a hashref with at least one option:

  • EMAIL

    A valid email address which can be used to contact the Robot's owner, for example by someone who wishes to complain about the behavior of your robot.

For other options please see to "ROBOT_ATTRIBUTES" in WWW::Robot

status_storage

Status storage for saving the result of web crawl. If defined Storable will be used to store the current state.

auto_save

Auto save current status every N minutes (defaults to 0 - do not auto save).

Note: "status_storage" has to be defined.

html_sitemap_template

Template-Toolkit html sitemap template to be used by helper method "html_sitemap".

Default value for HTML sitemap template:

<html>
<head>
<title>Sitemap for [% site %]</title>
</head>
<body>
<ul>
[%- INCLUDE branch node = node -%]
</ul>
</body>
</html>

[%- BLOCK branch -%]
<li><a href="[% node.loc %]">[% node.title || node.loc %]</a>
[%     IF node.children.size -%]
<ul>
[%-
            FOREACH child IN node.children;
                INCLUDE branch node = child;
            END;
-%]
</ul>
[%     END -%]
</li>
[% END -%]

METHODS

run

print $mapper->run();

Creates a WWW::Robot object and starts to map the "site".

Scans your subclass for methods with :Hook(...) attributes to be added to robot object.

Please see "SUPPORTED_HOOKS" in WWW::Robot for full list.

txt_sitemap

print $mapper->txt_sitemap();

Create plain text sitemap.

Accepts following parameters:

with_id => 0|1
print $mapper->txt_sitemap( with_id => 1 );

Use id of each node instead of '*'.

Defaults to 0.

with_title => 0|1
print $mapper->txt_sitemap( with_title => 1 );

Add node title after node location.

Defaults to 0.

html_sitemap

print $mapper->html_sitemap(%TT_CONF);

Create HTML sitemap using template defined in "html_sitemap_template".

Allows to specify Template configuration options.

xml_sitemap

my $sitemap = $mapper->xml_sitemap();

# print xml
print $sitemap->xml();

# write to file
$sitemap->write('sitemap.xml');

Create XML sitemap (http://www.sitemaps.org). Returns Search::Sitemap object.

Accepts following parameters:

  • split_by

    my @sitemaps = $mapper->xml_sitemap(
        split_by => [
            '^/doc',
            '^/cat',
            '^/ila',
        ],
    );

    Arrayref of regular expressions used to split the resulting sitemap based on the page location. If this option is supplied the "xml_sitemap" will return an array of Search::Sitemap objects plus one additional for any urls not matched by items provided.

  • priority

    my $sitemap = $mapper->xml_sitemap(
        priority => 0.6,
    );

    or my $sitemap = $mapper->xml_sitemap( priority => { '^/doc/' => '+0.2', # same as 0.7 '^/ila/' => 0.4, '^/cat/' => 0.9, '^/$' => 1, }, );

    If priority is a scalar value it will be used as a default for all pages. If it is a hashref every link will be tested against the keys and the value will be assigned if matched. Supports relative values which will be added/subtracted to/from final priority.

    Final priority will be set to 0.0 if calculated one is negative. Final priority will be set to 1.0 if calculated one is higher then 1.0.

    Default priority is 0.5.

  • changefreq

    my $sitemap = $mapper->xml_sitemap(
        changefreq => 'daily',
    );

    or

    my $sitemap = $mapper->xml_sitemap(
        changefreq => {
            '^/doc/' => 'weekly',
            '^/ila/' => 'yearly'
            '^/cat/' => 'daily',
            '^/$' => 'always',
        },
    );

    If changefreq is a scalar value it will be used as a default for all pages.

    If it is a hashref every link will be tested against the keys and the value will be assigned if matched.

    Valid values are:

    always =item hourly =item daily =item weekly =item monthly =item yearly =item never

    Default changefreq is 'weekly'.

PREDEFINED HOOKS

restore_state

Restore state from "status_storage" using "retrieve" in Storable.

Uses "restore-state" in WWW::Robot.

save_state

Save state into "status_storage" using "store" in Storable.

Uses "save-state" in WWW::Robot.

AUTHOR

Alex J. G. Burzyński, <ajgb@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2010 by Alex J. G. Burzyński

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.