NAME
WWW::Sitemapper - Create site map by scanning a web site.
VERSION
Version 0.01
SYNOPSIS
WWW::Sitemapper - Create site map by scanning a web site.
package MyWebSite::Map;
use Moose;
use base qw( WWW::Sitemapper );
sub _build_robot_config {
my $self = shift;
return {
NAME => 'MyRobot',
EMAIL => 'me@domain.tld',
};
}
# you need to provide a follow-url-test hook in your subclass
sub url_test : Hook('follow-url-test') {
my $self = shift;
my ($robot, $hook_name, $uri) = @_;
my @restricted = (
qr{^/cat/login},
qr{^/cat/events},
qr{\?_search_string=},
);
my $url = $uri->path_query;
if ( $self->site->host eq $uri->host ) {
for my $re ( @restricted ) {
if ( $url =~ /$re/ ) {
return 0;
}
}
return 1;
}
return 0;
}
sub run_till_first_auto_save : Hook('continue-test') {
my $self = shift;
my ($robot) = @_;
if ( $self->_run_started_time + $self->auto_save < DateTime->now ) {
return 0;
}
return 1;
}
# as this is your class you may define your own methods as well
sub ping_google {
my $self = shift;
my $ua = LWP::UserAgent;
return $ua->get( 'http://www.google.com/webmasters/sitemaps/ping',
sitemap => $self->site .'google-sitemap.xml.gz'
);
}
and then
package main;
my $mapper = MyWebSite::Map->new(
site => 'http://mywebsite.com/',
status_storage => 'sitemap.data.storable',
auto_save => 10,
);
$mapper->run;
open(HTML, ">./sitemap.html") or die ("Cannot create sitemap.html: $!");
print HTML $mapper->html_sitemap;
close(HTML);
my $xml_sitemap = $mapper->xml_sitemap(
priority => '0.7',
changefreq => 'weekly;
);
$xml_sitemap->write('google-sitemap.xml.gz');
# call your own method
$mapper->ping_google();
and while mapper is still running take a peek what has been mapped
my $mapper = MyWebSite::Map->new(
site => 'http://mywebsite.com/',
status_storage => 'sitemap.data.storable',
);
$mapper->restore_state();
print $mapper->txt_sitemap();
ATTRIBUTES
site
Home page of the website to be mapped.
tree
Tree structure of the web site.
robot_config
WWW::Robot configuration options.
You need to define in your subclass builder method _build_robot_config which needs to return a hashref with at least one option:
EMAIL
A valid email address which can be used to contact the Robot's owner, for example by someone who wishes to complain about the behavior of your robot.
For other options please see to "ROBOT_ATTRIBUTES" in WWW::Robot
status_storage
Status storage for saving the result of web crawl. If defined Storable will be used to store the current state.
auto_save
Auto save current status every N minutes (defaults to 0 - do not auto save).
Note: "status_storage" has to be defined.
html_sitemap_template
Template-Toolkit html sitemap template to be used by helper method "html_sitemap".
Default value for HTML sitemap template:
<html>
<head>
<title>Sitemap for [% site %]</title>
</head>
<body>
<ul>
[%- INCLUDE branch node = node -%]
</ul>
</body>
</html>
[%- BLOCK branch -%]
<li><a href="[% node.loc %]">[% node.title || node.loc %]</a>
[% IF node.children.size -%]
<ul>
[%-
FOREACH child IN node.children;
INCLUDE branch node = child;
END;
-%]
</ul>
[% END -%]
</li>
[% END -%]
METHODS
run
print $mapper->run();
Creates a WWW::Robot object and starts to map the "site".
Scans your subclass for methods with :Hook(...) attributes to be added to robot object.
Please see "SUPPORTED_HOOKS" in WWW::Robot for full list.
txt_sitemap
print $mapper->txt_sitemap();
Create plain text sitemap.
Accepts following parameters:
- with_id => 0|1
-
print $mapper->txt_sitemap( with_id => 1 );
Use id of each node instead of '*'.
Defaults to 0.
- with_title => 0|1
-
print $mapper->txt_sitemap( with_title => 1 );
Add node title after node location.
Defaults to 0.
html_sitemap
print $mapper->html_sitemap(%TT_CONF);
Create HTML sitemap using template defined in "html_sitemap_template".
Allows to specify Template configuration options.
xml_sitemap
my $sitemap = $mapper->xml_sitemap();
# print xml
print $sitemap->xml();
# write to file
$sitemap->write('sitemap.xml');
Create XML sitemap (http://www.sitemaps.org). Returns Search::Sitemap object.
Accepts following parameters:
split_by
my @sitemaps = $mapper->xml_sitemap( split_by => [ '^/doc', '^/cat', '^/ila', ], );
Arrayref of regular expressions used to split the resulting sitemap based on the page location. If this option is supplied the "xml_sitemap" will return an array of Search::Sitemap objects plus one additional for any urls not matched by items provided.
priority
my $sitemap = $mapper->xml_sitemap( priority => 0.6, );
or my $sitemap = $mapper->xml_sitemap( priority => { '^/doc/' => '+0.2', # same as 0.7 '^/ila/' => 0.4, '^/cat/' => 0.9, '^/$' => 1, }, );
If priority is a scalar value it will be used as a default for all pages. If it is a hashref every link will be tested against the keys and the value will be assigned if matched. Supports relative values which will be added/subtracted to/from final priority.
Final priority will be set to 0.0 if calculated one is negative. Final priority will be set to 1.0 if calculated one is higher then 1.0.
Default priority is 0.5.
changefreq
my $sitemap = $mapper->xml_sitemap( changefreq => 'daily', );
or
my $sitemap = $mapper->xml_sitemap( changefreq => { '^/doc/' => 'weekly', '^/ila/' => 'yearly' '^/cat/' => 'daily', '^/$' => 'always', }, );
If changefreq is a scalar value it will be used as a default for all pages.
If it is a hashref every link will be tested against the keys and the value will be assigned if matched.
Valid values are:
Default changefreq is 'weekly'.
PREDEFINED HOOKS
restore_state
Restore state from "status_storage" using "retrieve" in Storable.
Uses "restore-state" in WWW::Robot.
save_state
Save state into "status_storage" using "store" in Storable.
Uses "save-state" in WWW::Robot.
AUTHOR
Alex J. G. Burzyński, <ajgb@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2010 by Alex J. G. Burzyński
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.