NAME
WWW::Crawler::Mojo - A web crawling framework for Perl
SYNOPSIS
use strict;
use warnings;
use WWW::Crawler::Mojo;
my $bot = WWW::Crawler::Mojo->new;
$bot->on(res => sub {
my ($bot, $scrape, $job, $res) = @_;
$scrape->();
});
$bot->on(refer => sub {
my ($bot, $enqueue, $job, $context) = @_;
$enqueue->();
});
$bot->enqueue('http://example.com/');
$bot->crawl;
DESCRIPTION
WWW::Crawler::Mojo is a web crawling framework for those who familier with Mojo::* APIs.
Note that the module is aimed at trivial use cases of crawling within a moderate range of web pages so DO NOT use it for persistent crawler jobs.
ATTRIBUTES
WWW::Crawler::Mojo inherits all attributes from Mojo::EventEmitter and implements the following new ones.
ua
A Mojo::UserAgent instance.
my $ua = $bot->ua;
$bot->ua(Mojo::UserAgent->new);
ua_name
Name of crawler for User-Agent header.
$bot->ua_name('my-bot/0.01 (+https://example.com/)');
say $bot->ua_name; # 'my-bot/0.01 (+https://example.com/)'
active_conn
A number of current connections.
$bot->active_conn($bot->active_conn + 1);
say $bot->active_conn;
active_conns_per_host
A number of current connections per host.
$bot->active_conns_per_host($bot->active_conns_per_host + 1);
say $bot->active_conns_per_host;
depth
A number of max depth to crawl. Note that the depth is the number of HTTP requests to get to URI starting with the first job. This doesn't mean the deepness of URI path detected with slash.
$bot->depth(5);
say $bot->depth; # 5
fix
A hash whoes keys are md5 hashes of enqueued URLs.
max_conn
A number of max connections.
$bot->max_conn(5);
say $bot->max_conn; # 5
max_conn_per_host
A number of max connections per host.
$bot->max_conn_per_host(5);
say $bot->max_conn_per_host; # 5
peeping_port
An port number for providing peeping monitor. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.
$bot->peeping_port(3001);
say $bot->peeping_port; # 3000
peeping_max_length
Max length of peeping monitor content.
$bot->peeping_max_length(100000);
say $bot->peeping_max_length; # 100000
queue
FIFO array contains WWW::Crawler::Mojo::Job objects.
push(@{$bot->queue}, WWW::Crawler::Mojo::Job->new(...));
my $job = shift @{$bot->queue};
shuffle
An interval in seconds to shuffle the job queue. It also evalutated as boolean for disabling/enabling the feature. Defaults to undef, meaning disable.
$bot->shuffle(5);
say $bot->shuffle; # 5
EVENTS
WWW::Crawler::Mojo inherits all events from Mojo::EventEmitter and implements the following new ones.
res
Emitted when crawler got response from server.
$bot->on(res => sub {
my ($bot, $scrape, $job, $res) = @_;
if (...) {
$scrape->();
} else {
# DO NOTHING
}
});
refer
Emitted when new URI is found. You can enqueue the URI conditionally with the callback.
$bot->on(refer => sub {
my ($bot, $enqueue, $job, $context) = @_;
if (...) {
$enqueue->();
} elseif (...) {
$enqueue->(...); # maybe different url
} else {
# DO NOTHING
}
});
empty
Emitted when queue length got zero. The length is checked every 5 seconds.
$bot->on(empty => sub {
my ($bot) = @_;
say "Queue is drained out.";
});
error
Emitted when user agent returns no status code for request. Possibly caused by network errors or un-responsible servers.
$bot->on(error => sub {
my ($bot, $error, $job) = @_;
say "error: $_[1]";
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
Note that server errors such as 404 or 500 cannot be catched with the event. Consider res event for the use case instead of this.
start
Emitted right before crawl is started.
$bot->on(start => sub {
my $self = shift;
...
});
METHODS
WWW::Crawler::Mojo inherits all methods from Mojo::EventEmitter and implements the following new ones.
crawl
Start crawling loop.
$bot->crawl;
init
Initialize crawler settings.
$bot->init;
process_job
Process a job.
$bot->process_job;
say_start
Displays starting messages to STDOUT
$bot->say_start;
peeping_handler
peeping API dispatcher.
$bot->peeping_handler($loop, $stream);
scrape
Parses and discovers links in a web page. Each links are appended to FIFO array.
$bot->scrape($res, $job);
enqueue
Append one or more URIs or WWW::Crawler::Mojo::Job objects.
$bot->enqueue('http://example.com/index1.html');
OR
$bot->enqueue($job1, $job2);
OR
$bot->enqueue(
'http://example.com/index1.html',
'http://example.com/index2.html',
'http://example.com/index3.html',
);
requeue
Append one or more URLs or jobs for re-try. This accepts same arguments as enqueue method.
$self->on(error => sub {
my ($self, $msg, $job) = @_;
if (...) { # until failur occures 3 times
$bot->requeue($job);
}
});
collect_urls_html
Collects URLs out of HTML.
WWW::Crawler::Mojo::collect_urls_html($dom, sub {
my ($uri, $dom) = @_;
});
collect_urls_css
Collects URLs out of CSS.
WWW::Crawler::Mojo::collect_urls_css($dom, sub {
my $uri = shift;
});
guess_encoding
Guesses encoding of HTML or CSS with given Mojo::Message::Response instance.
$encode = WWW::Crawler::Mojo::guess_encoding($res) || 'utf-8'
resolve_href
Resolves URLs with a base URL.
WWW::Crawler::Mojo::resolve_href($base, $uri);
CONSTANTS
%tag_attributes
A catalog of HTML attribute names which possibly contain URLs.
script => ['src'],
link => ['href'],
a => ['href'],
img => ['src'],
area => ['href', 'ping'],
embed => ['src'],
frame => ['src'],
iframe => ['src'],
input => ['src'],
object => ['data'],
form => ['action'],
EXAMPLE
https://github.com/jamadam/WWW-Flatten
AUTHOR
Sugama Keita, <sugama@jamadam.com>
COPYRIGHT AND LICENSE
Copyright (C) jamadam
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.