NAME

Scrappy - All Powerful Web Spidering, Scrapering, Crawling Framework

VERSION

version 0.9111110

SYNOPSIS

#!/usr/bin/perl
use Scrappy;

my  $scraper = Scrappy->new;
    $scraper->crawl('search.cpan.org',
        '/recent' => {
            '#cpansearch li a' => sub {
                print $_[1]->{href}, "\n";
            }
        }
    );

DESCRIPTION

Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages, web services, and more. Scrappy is a feature rich, flexible, intelligent web automation tool.

Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.

METHODS

crawl

The crawl method is very useful when it is desired to crawl an entire website or at-least partially, it automates the tasks of creating a queue, fetching and parsing html pages, and establishing simple flow-control. See the SYNOPSIS for a simplified example, ... the following is a more complex example.

my  $scrappy = Scrappy->new;

$scrappy->crawl('http://search.cpan.org/recent',
    '/recent' => {
        
        '#cpansearch li a' => sub {
            my ($self, $item) = @_;
            # follow all recent modules from search.cpan.org
            $self->queue->add($item->{href});
        }
        
    },
    '/~:author/:name-:version/' => {
        
        'body' => sub {
            my ($self, $item, $args) = @_;
            
            my $reviews = $self
            ->select('.box table tr')->focus(3)->select('td.cell small a')
            ->data->[0]->{text};
            
            $reviews = $reviews =~ /\d+ Reviews/ ?
                $reviews : '0 reviews';
            
            print "found $args->{name} version $args->{version} ".
                "[$reviews] by $args->{author}\n";
            
        }
        
    }
);

AUTHOR

Al Newkirk <awncorp@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2010 by awncorp.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.