NAME
Scrappy - Simple Stupid Spider base on Web::Scraper inspired by Dancer
VERSION
version 0.52
SYNOPSIS
#!/usr/bin/perl
use Scrappy qw/:syntax/;
init;
user_agent random_ua;
get 'http://google.com';
form fields => {
q => "what is perl"
};
var 'results' =>
grab '#search li h3 a', { name => 'TEXT', link => '@href' };
DESCRIPTION
Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages. Internally Scrappy uses the awesome Web::Scraper and WWW::Mechanize modules so as such Scrappy imports its awesomeness. Scrappy is inspired by the fun and easy-to-use Dancer API. Beyond being a pretty API for WWW::Mechanize::Plugin::Web::Scraper, Scrappy also has the persistant cookie handling, session handling, and more.
Scrappy == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy although Python has a web scraping framework by that name and we don't plagiarize Python code here.
METHODS
init
Builds the scraper application instance. This function should be called before issuing any other commands as this function creates the application instance all other funciton will use. This function returns the current scraper application instance.
my $scraper = init;
self
This method returns the current scraper application instance which can also be found in the global class variable $class_Instance.
init;
get $requested_url;
my $scraper = self;
user_agent
This method sets the user-agent for the current scraper application instance.
init;
user_agent 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8';
var
This method sets a stash (shared) variable or returns the entire stash object.
var age => 31;
print var->{age};
# 30
my @array = (1..20);
var integers => @array;
# stash variable nesting
var 'user/profile/name' => 'Mr. Foobar';
print var->{user}->{profile}->{name};
# Mr. Foobar
random_ua
This returns a random user-agent string for use with the user_agent method. The user-agent header in your request is how inquiring application determine your browser and environment. The first argument should be the name of the web browser, supported web browsers are any, chrome, ie or explorer, opera, safari, and firfox. Obviously using the keyword `any` will select from any available browser. The second argument which is optional should be the name of the desired operating system, supported operating systems are windows, macintosh, linux.
init;
user_agent random_ua;
# same as random_ua 'any';
e.g. for a Linux-specific user-agent use the following...
init;
user_agent random_ua 'chrome', 'linux';
form
The form method is a shortcut to the WWW::Mechanize submit_form method. It take the exact same arguments, yada, yada.
init;
get $requested_login_url;
form fields => {
username => 'mrmagoo',
password => 'foobarbaz'
};
get
The get method is a shortcut to the WWW:Mechanize get method. This method takes a URL or URI and returns an HTTP::Response object.
post
The post method is a shortcut to the WWW:Mechanize post method. This method takes a URL or URI and a hashref of key/value pairs then returns an HTTP::Response object. Alternatively the post object can be used traditionally (ugly), and passed additional arguments;
# our pretty way
post $requested_url, {
query => 'some such stuff'
};
# traditionally
post $requested_url,
'Content-Type' => 'multipart/form-data',
'Content' => {
user => $facebook->{user},
profile_id => $prospect->{i},
message => '',
source => '',
src => 'top_bar',
submit => 1,
post_form_id => $post_formid,
fb_dtsg => 'u9MeI',
post_form_id_source => 'AsyncRequest'
};
Note! Our prettier version of the post method use a content-type of application/x-www-form-urlencoded by default, to use multipart/form-data, please use the traditional style, sorry.
grab
The grab method is a shortcut to the Web::Scraper process method. It take the exact same arguments with a little bit of our own added magic.
init;
get $requested_url;
grab '#profile li a';
# meaning you can do cool stuff like...
var user_name => grab '#profile li a';
# the traditional use is to provide a selector and mappings ..., e.g.
grab '#profile li', { name => 'TEXT', link => '@href' };
loaded
The loaded method is a shortcut to the WWW:Mechanize success method. This method returns true/false based on whether the last request was successful.
init;
get $requested_url;
if (loaded) {
grab ...
}
status
The status method is a shortcut to the WWW:Mechanize status method. This method returns the 3-digit HTTP status code of the response.
init;
get $requested_url;
if (status == 200) {
grab ...
}
reload
The reload method is a shortcut to the WWW:Mechanize reload method. This method acts like the reload button in a browser, repeats the current request.
back
The back method is a shortcut to the WWW:Mechanize back method. This method is equivalent of hitting the "back" button in a browser, it returns the previous response (page), it will not backtrack beyond the first request.
page
The page method is a shortcut to the WWW:Mechanize uri method. This method returns the URI of the current page.
response
The response method is a shortcut to the WWW:Mechanize response method. This method returns the HTTP::Repsonse object of the current page.
content_type
The content_type method is a shortcut to the WWW:Mechanize content_type method. This method returns the content_type of the current page.
domain
The domain method is a shortcut to the WWW:Mechanize base method. This method returns URI of the current page.
ishtml
The ishtml method is a shortcut to the WWW:Mechanize is_html method. This method returns true/false on whether our content is HTML, according to the HTTP headers.
title
The title method is a shortcut to the WWW:Mechanize title method. This method returns the content of the title tag if the current page is HTML, otherwise returns undef.
text
The text method is a shortcut to the WWW:Mechanize content method using the format argument and returns a text representation of the last page having all HTML markup stripped.
html
The html method is a shortcut to the WWW:Mechanize content method. This method returns the content of the current page.
data
The data method is a shortcut to the WWW:Mechanize content method. This method returns the content of the current page. Additionally this method when passed a single argument, updates the content of the current page with that data and returns the modified content.
www
The www method is an alias to the self method. This method returns the current scraper application instance.
store
The store method is a shortcut to the WWW:Mechanize save_content method. This method returns dumps the contents of the current page into the specified file. If the content-type does not begin with 'text', the content is saved as binary data. If the store method is passed a URI and a File Path, then it will follow the link, store the contents in the file and return to the previous page.
download
The download method is an alias to the store method.
list
The list method is an aesthetically pleasing method of dereferencing an arrayref. This method dies if the argument is not an arrayref.
foreach my $item (list var->{items}) {
...
}
fst
The fst (first) method shifts the passed in arrayref returning the first element in the array shortening it by one.
var foo => fst grab '.class', { name => 'TEXT' };
lst
The lst (last) method pops the passed in arrayref returning the last element in the array shortening it by one.
var foo => lst grab '.class', { name => 'TEXT' };
AUTHOR
Al Newkirk <awncorp@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2010 by awncorp.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.