WWW-Crawler-Mojo

WWW::Crawler::Mojo is a web crawling framework written in Perl on top of mojo toolkit, allowing you to write your own crawler rapidly.

This software is considered to be alpha quality and isn't recommended for regular usage.

Features

Easy to rule your crawler.
Allows to use Mojo::URL for URL manipulations, Mojo::Message::Response for response manipulation and Mojo::DOM for DOM inspection.
Internally uses Mojo::UserAgent which supports non-blocking I/O HTTP and WebSocket with IPv6, TLS, SNI, IDNA, HTTP/SOCKS5 proxy, Comet (long polling), keep-alive, connection pooling, timeout, cookie, multipart, gzip compression.
Throttle the connection with max connection and max connection per host options.
Depth detection.
Tracks 301 HTTP redirects.
Network error detection.
Retry with your own rules.
Shuffling the queue periodically.
Peeping server for crawler development.
Crawl beyond basic authentication.

Requirements

Perl 5.14
Mojolicious 5.75

Usage

use WWW::Crawler::Mojo;

my $bot = WWW::Crawler::Mojo->new;

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->() if (...); # collect URLs from this document
});

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() if (...); # enqueue this job
});

$bot->enqueue('http://example.com/');
$bot->crawl;

Installation

$ cpanm WWW::Crawler::Mojo

Documentation

Other examples

WWW-Flatten
See the scripts under the example directory.

Copyright

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)