NAME
WWW::Flatten - Flatten a web pages deeply and make it portable
SYNOPSIS
use strict;
use warnings;
use utf8;
use 5.010;
use WWW::Flatten;
my $basedir = './github/';
mkdir($basedir);
my $bot = WWW::Flatten->new(
basedir => $basedir,
max_conn => 1,
max_conn_per_host => 1,
depth => 3,
filenames => {
'https://github.com' => 'index.html',
},
is_target => sub {
my $uri = shift->url;
if ($uri =~ qr{\.(css|png|gif|jpeg|jpg|pdf|js|json)$}i) {
return 1;
}
if ($uri->host eq 'assets-cdn.github.com') {
return 1;
}
return 0;
},
normalize => sub {
my $uri = shift;
...
return $uri;
}
);
$bot->crawl;
DESCRIPTION
WWW::Flatten is a web crawling tool for freezing pages into standalone.
This software is considered to be alpha quality and isn't recommended for regular usage.
ATTRIBUTES
depth
Depth limitation. Defaults to 10.
$ua->depth(10);
filenames
URL-Filename mapping table. This well automatically be increased during crawling but you can pre-define some beforehand.
$bot->finenames({
'http://example.com/index.html' => 'index.html',
'http://example.com/index2.html' => 'index2.html',
})
basedir
A directory path for output files.
$bot->basedir('./out');
is_target
Set the condition which indecates whether the job is flatten target or not.
$bot->is_target(sub {
my $job = shift;
...
return 1 # or 0
});
'normalize'
A code reference which perform normalization for URLs. The callback will take Mojo::URL instance.
$bot->normalize(sub {
my $url = shift;
my $modified = ...;
return $modified;
});
asset_name
A code reference that generates asset names. Defaults to a preset generator asset_number_generator, which generates 6 digit number. There provides another option asset_hash_generator, which generates 6 character hash.
$bot->asset_name(WWW::Flatten::asset_hash_generator(6));
max_retry
Max attempt limit of retry in case the server in inresponsible. Defaults to 3.
METHODS
asset_number_generator
Numeric file name generating closure with self containing storage. See also asset_name attribute.
$bot->asset_name(WWW::Flatten::asset_number_generator(3));
asset_hash_generator
Hash-based file name generating closure with self containing storage. See also asset_name attribute. This function automatically avoid name collision by extending the given length.
If you want the names as short as possible, use the following setting.
$bot->asset_name(WWW::Flatten::asset_hash_generator(1));
init
Initialize the crawler
get_href
Generate new href with old one.
flatten_html
Replace URLs in a Mojo::DOM instance, according to filenames attribute.
flatten_css
Replace URLs in a CSS string, according to filenames attribute.
save
Save HTTP response into a file.
AUTHOR
Sugama Keita, <sugama@jamadam.com>
COPYRIGHT AND LICENSE
Copyright (C) jamadam
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.