NAME
Scrappy - All Powerful Web Harvester, Spider, Scraper fully automated
VERSION
version 0.591
SYNOPSIS
Crawl an entire website or more with three lines of code.
#!/usr/bin/perl
use Scrappy qw/:syntax/;
crawl 'http://somewebsite.com' {
'a' => sub { queue shift->href },
'/*' => {
# do something
}
};
Spider, Scrape or Harvest data from websites like never before, with ease.
#!/usr/bin/perl
use Scrappy qw/:syntax/;
user_agent random_ua;
get 'http://search.cpan.org/recent';
if (loaded) {
var date => grab '.datecell b';
var modules => grab '#cpansearch li a', { name => 'TEXT', link => '@href' };
}
print $_->{name}, "\n" for list var->{modules};
Trace page fetches during crawling by using the 'Scrappy_Trace' environment variable, e.g. ...
$ENV{Scrappy_Trace} = 1;
crawl 'http://somewebsite.com' {
'a' => sub { queue shift->href },
'/*' => {
# do something
}
};
DESCRIPTION
Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages. Internally Scrappy uses the awesome Web::Scraper and WWW::Mechanize modules so as such Scrappy imports its awesomeness. Scrappy is inspired by the fun and easy-to-use Dancer API. Beyond being a pretty API for WWW::Mechanize::Plugin::Web::Scraper, Scrappy also has its own featuer-set which makes web scraping easier and more enjoyable.
Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.
METHODS
init
Builds the scraper application instance. This function should be called before issuing any other commands as this function creates the application instance all other functions will use. This function returns the current scraper application instance.
my $scraper = init;
reset
The reset method is an alias to the init method. This function should be called before issuing any other commands as this function creates the application instance all other functions will use. This function returns the current scraper application instance.
self
This method returns the current scraper application instance which can also be found in the package class variable $class_Instance.
init;
get $requested_url;
my $scraper = self;
user_agent
This method gets/sets the user-agent for the current scraper application instance.
init;
user_agent 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8';
var
This method sets a stash (shared) variable or returns a reference to the entire stash object.
var age => 31;
print var->{age};
# 31
my @array = (1..20);
var integers => @array;
var->{foo}->{bar} = 'baz';
# stash variable nesting ** depreciated ** not recommended **
var 'user/profile/name' => 'Mr. Foobar';
print var->{user}->{profile}->{name};
random_ua
This returns a random user-agent string for use with the user_agent method. The user-agent header in your request is how an inquiring application might determine the browser and environment making the request. The first argument should be the name of the web browser, supported web browsers are any, chrome, ie or explorer, opera, safari, and firfox. Obviously using the keyword `any` will select from any available browser. The second argument which is optional should be the name of the desired operating system, supported operating systems are windows, macintosh, and linux.
init;
user_agent random_ua;
# same as random_ua 'any';
e.g. for a Linux-specific Google Chrome user-agent use the following...
init;
user_agent random_ua 'chrome', 'linux';
form
The form method is a shortcut to the WWW::Mechanize submit_form method. It take the exact same arguments, yada, yada.
init;
get $requested_login_url;
form fields => {
username => 'mrmagoo',
password => 'foobarbaz'
};
# or more specifically
form form_number => 1, fields => {
username => 'mrmagoo',
password => 'foobarbaz'
};
get
The get method is a shortcut to the WWW::Mechanize get method. This method takes a URL or URI and returns an HTTP::Response object.
post
The post method is a shortcut to the WWW::Mechanize post method. This method takes a URL or URI and a hashref of key/value pairs then returns an HTTP::Response object. Alternatively the post object can be used traditionally (ugly), and passed additional arguments;
# our pretty way
post $requested_url, {
query => 'some such stuff'
};
# traditionally
post $requested_url,
'Content-Type' => 'multipart/form-data',
'Content' => {
user => $facebook->{user},
profile_id => $prospect->{i},
message => '',
source => '',
src => 'top_bar',
submit => 1,
post_form_id => $post_formid,
fb_dtsg => 'u9MeI',
post_form_id_source => 'AsyncRequest'
};
Note! Our prettier version of the post method uses a content-type of application/x-www-form-urlencoded by default, to use multipart/form-data, please use the traditional style, sorry.
grab
The grab method is a shortcut to the Web::Scraper process method. It take the exact same arguments with a little bit of our own added magic, namely you can grab and return single-selections and even specify the return values, by default the return value of a single-selection is TEXT. Note! Use a hashref mapping to return a list of results, this may change in the future.
init;
get $requested_url;
grab '#profile li a'; # single-selection
grab '#profile li a', '@href'; # specifically returning href attribute
# meaning you can do cool stuff like...
var user_name => grab '#profile li a';
# the traditional use is to provide a selector and mappings/return values e.g.
grab '#profile li a', { name => 'TEXT', link => '@href' };
zoom
The zoom method is almost exactly the same as the Scrappy grab method except that you specify what data to scrape as opposed to the grab method that parses the entire page. This is more of a drill-down utility. Note! Use a hashref mapping to return a list of results, this may change in the future.
init;
get $requested_url;
var items => grab '#find ul li', { id => '@id', content => 'HTML' };
foreach my $el (list var->{items}) {
var->{$el->{id}}->{title} => zoom $el->{content}, '.title';
}
# just a silly example but zoom has many very good uses
# it is more of a drill-down utility
loaded
The loaded method is a shortcut to the WWW::Mechanize success method. This method returns true/false based on whether the last request was successful.
init;
get $requested_url;
if (loaded) {
grab ...
}
status
The status method is a shortcut to the WWW::Mechanize status method. This method returns the 3-digit HTTP status code of the response.
init;
get $requested_url;
if (status == 200) {
grab ...
}
reload
The reload method is a shortcut to the WWW::Mechanize reload method. This method acts like the refresh button in a browser, repeats the current request.
back
The back method is a shortcut to the WWW::Mechanize back method. This method is the equivalent of hitting the "back" button in a browser, it returns the previous page (response), it will not backtrack beyond the first request.
page
The page method is a shortcut to the WWW::Mechanize uri method. This method returns the URI of the current page as a URI object.
response
The response method is a shortcut to the WWW::Mechanize response method. This method returns the HTTP::Repsonse object of the current page.
content_type
The content_type method is a shortcut to the WWW::Mechanize content_type method. This method returns the content_type of the current page.
domain
The domain method is a shortcut to the WWW::Mechanize base method. This method returns URI host of the current page.
ishtml
The ishtml method is a shortcut to the WWW::Mechanize is_html method. This method returns true/false based on whether our content is HTML, according to the HTTP headers.
title
The title method is a shortcut to the WWW::Mechanize title method. This method returns the content of the title tag if the current page is HTML, otherwise returns undef.
text
The text method is a shortcut to the WWW::Mechanize content method using the format argument and returns a text representation of the last page having all HTML markup stripped.
html
The html method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page.
data
The data method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page. Additionally this method when passed data, updates the content of the current page with that data and returns the modified content.
www
The www method is an alias to the self method. This method returns the current scraper application instance.
store
The store method is a shortcut to the WWW::Mechanize save_content method. This method stores the contents of the current page into the specified file. If the content-type does not begin with 'text', the content is saved as binary data.
get $requested_url;
store '/tmp/foo.html';
download
The download method is passed a URI, a Download Directory Path and a optionally a File Path, then it will follow the link and store the response contents into the specified file without leaving the current page. Basically it downloads the contents of the request (especially when the request pushes a file download). If a File Path is not specified, Scrappy will attempt to name the file automatically resorting to a random 6-charater string only if all else fails.
download $requested_url, '/tmp';
# supply your own file name
download $requested_url, '/tmp', 'somefile.txt';
list
The list method is an aesthetically pleasing method of dereferencing an arrayref. This is useful when iterating over a scraped resultset. This method no longer dies if the argument is not an arrayref and instead returns an empty list.
foreach my $item (list var->{items}) {
...
}
fst
The fst (first) method shifts the passed in arrayref returning the first element in the array shortening it by one.
var foo => fst grab '.class', { name => 'TEXT' };
lst
The lst (last) method pops the passed in arrayref returning the last element in the array shortening it by one.
var foo => lst grab '.class', { name => 'TEXT' };
session
The session method provides a means for storing important data across executions. There is one special session variable `_file` whose value is used to define the file where session data will be stored. Please make sure the session file exists and is writable. As I am sure you've deduced from the example, the session file will be stored as YAML code. Cookies are automatically stored in and retrieved from your session file automatically.
init;
session _file => '/tmp/foo_session.yml';
session foo => 'bar';
my $var = session->{foo};
# $var == 'bar'
Please make sure to create a valid session file, use the following as an example and note that there is a newline on the alst line of the file:
# scrappy session file
---
config
The config method is an alias to the Scrappy session method for readability.
cookies
The cookies method is a shortcut to the automatically generated WWW::Mechanize cookie handler. This method returns an HTTP::Cookie object. Setting this as undefined using the _undef keyword will prevent cookies from being stored and subsequently read.
init;
get $requested_url;
my $cookies = cookies;
# prevent cookie storage
init;
cookies _undef;
proxy
The proxy method is a shortcut to the WWW::Mechanize proxy function. This method set the proxy for the next request to be tunneled through. Setting this as undefined using the _undef keyword will reset the scraper application instance so that all subsequent requests will not use a proxy.
init;
proxy 'http', 'http://proxy.example.com:8000/';
get $requested_url;
init;
proxy 'http', 'ftp', 'http://proxy.example.com:8000/';
get $requested_url;
# best practice
use Tiny::Try;
init;
proxy 'http', 'ftp', 'http://proxy.example.com:8000/';
try {
get $requested_url
};
Note! When using a proxy to perform requests, be aware that if they fail your program will die unless you wrap yoru code in an eval statement or use a try/catch module. In the example above we use Tiny::Try to trap an errors that might occur when using a proxy.
pause
The pause method is an adaptation of the WWW::Mechanize::Sleep module. This method sets breaks between your requests in an attempt to simulate human interaction.
init;
pause 20;
get $request_1;
get $request_2;
get $request_3;
The will be a break between each request made, get, post, request, etc., You can also specify a range to have the pause method select from at random...
init;
pause 5,20;
get $request_1;
get $request_2;
# reset/turn it off
pause 0;
print "I slept for ", (pause), " seconds";
Note! The download method is exempt from any automatic pausing, to pause after a download one could obviously...
download $requested_url, '/tmp';
sleep pause();
cursor
The cursor method is used internally by the crawl method to determine what pages in the queue should be fetched next after the completion of the current fetch. This method returns the position of the cursor in the queue.
queue
The queue method is used to add valid URIs to the page fetching queue used by the crawl method internally, or to return the list of added URIs in the order received/input.
queue $new_url;
my @urls = queue;
crawl
The crawl method is designed to automatically and systematically crawl, spider, or fetch webpages and perform actions on selected elements on each page.
crawl $starting_url, {
'a' => sub {
# find all links and add them to the queue to be crawled
queue shift->href;
},
'/*' => sub {
# /* simply matches the root node, same as using 'body' in
# html page context, maybe do soemthing with shift->text or shift->html
},
'img' => sub {
# print all image URLs
print shift->src, "\n"
}
};
history
The history method returns a list of visted pages.
get $url_a;
get $url_b;
get $url_c;
print history;
denied
The denied method is a simple shortcut to determine if the page you requested got loaded or redirected. This method is very useful on systems that require authentication and redirect if not authorized. This function return boolean, 1 if the current page doesn't match the requested page.
get $url_to_dashboard;
if (denied) {
# do login, again
}
else {
# resume ...
}
AUTHOR
Al Newkirk <awncorp@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2010 by awncorp.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.