NAME

WWW::Flatten - Flatten a web pages deeply and make it portable

SYNOPSIS

use strict;
use warnings;
use utf8;
use 5.010;
use WWW::Flatten;

my $basedir = './github/';
mkdir($basedir);

my $bot = WWW::Flatten->new(
    basedir => $basedir,
    max_conn => 1,
    max_conn_per_host => 1,
    depth => 3,
    filenames => {
        'https://github.com' => 'index.html',
    },
    is_target => sub {
        my $uri = shift->url;
        
        if ($uri =~ qr{\.(css|png|gif|jpeg|jpg|pdf|js|json)$}i) {
            return 1;
        }
        
        if ($uri->host eq 'assets-cdn.github.com') {
            return 1;
        }
        
        return 0;
    },
    normalize => sub {
        my $uri = shift;
        ...
        return $uri;
    }
);

$bot->crawl;

DESCRIPTION

WWW::Flatten is a web crawling tool for freezing pages into standalone.

This software is considered to be alpha quality and isn't recommended for regular usage.

ATTRIBUTES

depth

Depth limitation. Defaults to 10.

$ua->depth(10);

filenames

URL-Filename mapping table. This well automatically be increased during crawling but you can pre-define some beforehand.

$bot->finenames({
    'http://example.com/index.html' => 'index.html',
    'http://example.com/index2.html' => 'index2.html',
})

basedir

A directory path for output files.

$bot->basedir('./out');

is_target

Set the condition which indecates whether the job is flatten target or not.

$bot->is_target(sub {
    my ($job, $context) = @_;
    ...
    return 1 # or 0
});

'normalize'

A code reference which perform normalization for URLs. The callback will take Mojo::URL instance.

$bot->normalize(sub {
    my $url = shift;
    my $modified = ...;
    return $modified;
});

asset_name

A code reference that generates asset names. Defaults to a preset generator asset_number_generator, which generates 6 digit number. There provides another option asset_hash_generator, which generates 6 character hash.

$bot->asset_name(WWW::Flatten::asset_hash_generator(6));

max_retry

Max attempt limit of retry in case the server in inresponsible. Defaults to 3.

types

MIME types. Defaults to Mojolicious::Types.

METHODS

asset_number_generator

Numeric file name generating closure with self containing storage. See also asset_name attribute.

$bot->asset_name(WWW::Flatten::asset_number_generator(3));

asset_hash_generator

Hash-based file name generating closure with self containing storage. See also asset_name attribute. This function automatically avoid name collision by extending the given length.

If you want the names as short as possible, use the following setting.

$bot->asset_name(WWW::Flatten::asset_hash_generator(1));

init

Initialize the crawler

get_href

Generate new href with old one.

flatten_html

Replace URLs in a Mojo::DOM instance, according to filenames attribute.

flatten_css

Replace URLs in a CSS string, according to filenames attribute.

save

Save HTTP response into a file.

AUTHOR

Sugama Keita, <sugama@jamadam.com>

COPYRIGHT AND LICENSE

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.