NAME
Scrappy - All Powerful Web Spidering, Scrapering, Crawling Framework
VERSION
version 0.9111110
SYNOPSIS
#!/usr/bin/perl
use Scrappy;
my $scraper = Scrappy->new;
$scraper->crawl('search.cpan.org',
'/recent' => {
'#cpansearch li a' => sub {
print $_[1]->{href}, "\n";
}
}
);
DESCRIPTION
Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages, web services, and more. Scrappy is a feature rich, flexible, intelligent web automation tool.
Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.
METHODS
crawl
The crawl method is very useful when it is desired to crawl an entire website or at-least partially, it automates the tasks of creating a queue, fetching and parsing html pages, and establishing simple flow-control. See the SYNOPSIS for a simplified example, ... the following is a more complex example.
my $scrappy = Scrappy->new;
$scrappy->crawl('http://search.cpan.org/recent',
'/recent' => {
'#cpansearch li a' => sub {
my ($self, $item) = @_;
# follow all recent modules from search.cpan.org
$self->queue->add($item->{href});
}
},
'/~:author/:name-:version/' => {
'body' => sub {
my ($self, $item, $args) = @_;
my $reviews = $self
->select('.box table tr')->focus(3)->select('td.cell small a')
->data->[0]->{text};
$reviews = $reviews =~ /\d+ Reviews/ ?
$reviews : '0 reviews';
print "found $args->{name} version $args->{version} ".
"[$reviews] by $args->{author}\n";
}
}
);
AUTHOR
Al Newkirk <awncorp@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2010 by awncorp.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.