WWW-Crawler-Mojo

WWW::Crawler::Mojo is a web crawling framework written in Perl on top of mojo toolkit, allowing you to write your own crawler rapidly.

This software is considered to be alpha quality and isn't recommended for regular usage.

Features

Requirements

Synopsis

use WWW::Crawler::Mojo;

my $bot = WWW::Crawler::Mojo->new;

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;

    $cb = sub {
        my ($bot, $enqueue, $job, $context) = @_;
        $enqueue->() if (...); # enqueue this job
    }
    
    $scrape->($cb) if (...); # collect URLs from this document
});

$bot->enqueue('http://example.com/');
$bot->crawl;

Installation

$ cpanm WWW::Crawler::Mojo

Documentation

Examples

Restricting scraping URLs by status code.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless ($res->code == 200);
    $scrape->();
});

Restricting scraping URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless if ($job->url->host eq 'example.com');
    $scrape->();
});

Restrict following URLs by depth.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->(sub {
        my ($bot, $enqueue, $job, $context) = @_;
        return unless ($job->depth < 5)
        $enqueue->();
    });
});

Restrict following URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->(sub {
        my ($bot, $enqueue, $job, $context) = @_;
        $enqueue->() if $job->url->host eq 'example.com';
    });
});

Restrict following URLs by referrer's host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->(sub {
        my ($bot, $enqueue, $job, $context) = @_;
        $enqueue->() if $job->referrer->url->host eq 'example.com';
    });
});

Excepting following URLs by path.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->(sub {
        my ($bot, $enqueue, $job, $context) = @_;
        $enqueue->() unless ($job->url->path =~ qr{^/foo/});
    });
});

Crawl only preset URLs.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->(sub {});
});

$bot->enqueue(
	'http://example.com/1',
	'http://example.com/3',
	'http://example.com/5',
);

$bot->crawl;

Speed up.

$bot->max_conn(5);
$bot->max_conn_per_host(5);

Authentication. The user agent automatically reuses the credential for the host.

$bot->enqueue('http://jamadam:password@example.com');

You can fulfill any prerequisites such as login form submittion so that a login session will be established with cookie or something.

my $bot = WWW::Crawler::Mojo->new;
$bot->ua->post('http://example.com/admin/login', form => {
    username => 'jamadam',
    password => 'password',
});
$bot->enqueue('http://example.com/admin/');
$bot->crawl

Other examples

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.