NAME

Scrappy - All Powerful Web Harvester, Spider, Scraper at your service

VERSION

version 0.592

SYNOPSIS

Crawl an entire website or more with three lines of code.

#!/usr/bin/perl
use Scrappy qw/:syntax/;
    
crawl 'http://somewebsite.com' {
    'a' => sub { queue shift->href },
    '/*' => {
        # do something
    }
};

Spider, Scrape or Harvest data from websites like never before, with ease.

#!/usr/bin/perl
use Scrappy qw/:syntax/;
    
user_agent random_ua;

get 'http://search.cpan.org/recent';

if (loaded) {
    var date    => grab '.datecell b';
    var modules => grab '#cpansearch li a', { name => 'TEXT', link => '@href' };
}

print $_->{name}, "\n" for list var->{modules};

Trace page fetches during crawling by using the 'Scrappy_Trace' environment variable, e.g. ...

$ENV{Scrappy_Trace} = 1;
crawl 'http://somewebsite.com' {
    'a' => sub { queue shift->href },
    '/*' => {
        # do something
    }
};

DESCRIPTION

Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages. Internally Scrappy uses the awesome Web::Scraper and WWW::Mechanize modules so as such Scrappy imports its awesomeness. Scrappy is inspired by the fun and easy-to-use Dancer API. Beyond being a pretty API for WWW::Mechanize::Plugin::Web::Scraper, Scrappy also has its own featuer-set which makes web scraping easier and more enjoyable.

Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.

METHODS

init

Builds the scraper application instance. This function should be called before issuing any other commands as this function creates the application instance all other functions will use. This function returns the current scraper application instance.

my $scraper = init;

reset

The reset method is an alias to the init method. This function should be called before issuing any other commands as this function creates the application instance all other functions will use. This function returns the current scraper application instance.

self

This method returns the current scraper application instance which can also be found in the package class variable $class_Instance.

init;
get $requested_url;
my $scraper = self;

user_agent

This method gets/sets the user-agent for the current scraper application instance.

init;
user_agent 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8';

var

This method sets a stash (shared) variable or returns a reference to the entire stash object.

var age => 31;
print var->{age};
# 31

my @array = (1..20);
var integers => @array;

var->{foo}->{bar} = 'baz';

# stash variable nesting ** depreciated ** not recommended **
var 'user/profile/name' => 'Mr. Foobar';
print var->{user}->{profile}->{name};

random_ua

This returns a random user-agent string for use with the user_agent method. The user-agent header in your request is how an inquiring application might determine the browser and environment making the request. The first argument should be the name of the web browser, supported web browsers are any, chrome, ie or explorer, opera, safari, and firfox. Obviously using the keyword `any` will select from any available browser. The second argument which is optional should be the name of the desired operating system, supported operating systems are windows, macintosh, and linux.

init;
user_agent random_ua;
# same as random_ua 'any';

e.g. for a Linux-specific Google Chrome user-agent use the following...

init;
user_agent random_ua 'chrome', 'linux';

form

The form method is a shortcut to the WWW::Mechanize submit_form method. It take the exact same arguments, yada, yada.

init;
get $requested_login_url;
form fields => {
    username => 'mrmagoo',
    password => 'foobarbaz'
};

# or more specifically

form form_number => 1, fields => {
    username => 'mrmagoo',
    password => 'foobarbaz'
};

get

The get method is a shortcut to the WWW::Mechanize get method. This method takes a URL or URI and returns an HTTP::Response object.

post

The post method is a shortcut to the WWW::Mechanize post method. This method takes a URL or URI and a hashref of key/value pairs then returns an HTTP::Response object. Alternatively the post object can be used traditionally (ugly), and passed additional arguments;

# our pretty way
post $requested_url, {
    query => 'some such stuff'
};

# traditionally
post $requested_url,
    'Content-Type' => 'multipart/form-data',
    'Content'      => {
        user                => $facebook->{user},
        profile_id          => $prospect->{i},
        message             => '',
        source              => '',
        src                 => 'top_bar',
        submit              => 1,
        post_form_id        => $post_formid,
        fb_dtsg             => 'u9MeI',
        post_form_id_source => 'AsyncRequest'
    };

Note! Our prettier version of the post method uses a content-type of application/x-www-form-urlencoded by default, to use multipart/form-data, please use the traditional style, sorry.

grab

The grab method is a shortcut to the Web::Scraper process method. It take the exact same arguments with a little bit of our own added magic, namely you can grab and return single-selections and even specify the return values, by default the return value of a single-selection is TEXT. Note! Use a hashref mapping to return a list of results, this may change in the future.

init;
get $requested_url;
grab '#profile li a'; # single-selection
grab '#profile li a', '@href'; # specifically returning href attribute

# meaning you can do cool stuff like...
var user_name => grab '#profile li a';

# the traditional use is to provide a selector and mappings/return values e.g.
grab '#profile li a', { name => 'TEXT', link => '@href' };

zoom

The zoom method is almost exactly the same as the Scrappy grab method except that you specify what data to scrape as opposed to the grab method that parses the entire page. This is more of a drill-down utility. Note! Use a hashref mapping to return a list of results, this may change in the future.

init;
get $requested_url;

var items => grab '#find ul li', { id => '@id', content => 'HTML' };

foreach my $el (list var->{items}) {
    var->{$el->{id}}->{title} => zoom $el->{content}, '.title';
}

# just a silly example but zoom has many very good uses
# it is more of a drill-down utility

loaded

The loaded method is a shortcut to the WWW::Mechanize success method. This method returns true/false based on whether the last request was successful.

init;
get $requested_url;
if (loaded) {
    grab ...
}

status

The status method is a shortcut to the WWW::Mechanize status method. This method returns the 3-digit HTTP status code of the response.

init;
get $requested_url;
if (status == 200) {
    grab ...
}

reload

The reload method is a shortcut to the WWW::Mechanize reload method. This method acts like the refresh button in a browser, repeats the current request.

back

The back method is a shortcut to the WWW::Mechanize back method. This method is the equivalent of hitting the "back" button in a browser, it returns the previous page (response), it will not backtrack beyond the first request.

page

The page method is a shortcut to the WWW::Mechanize uri method. This method returns the URI of the current page as a URI object.

response

The response method is a shortcut to the WWW::Mechanize response method. This method returns the HTTP::Repsonse object of the current page.

content_type

The content_type method is a shortcut to the WWW::Mechanize content_type method. This method returns the content_type of the current page.

domain

The domain method is a shortcut to the WWW::Mechanize base method. This method returns URI host of the current page.

ishtml

The ishtml method is a shortcut to the WWW::Mechanize is_html method. This method returns true/false based on whether our content is HTML, according to the HTTP headers.

title

The title method is a shortcut to the WWW::Mechanize title method. This method returns the content of the title tag if the current page is HTML, otherwise returns undef.

text

The text method is a shortcut to the WWW::Mechanize content method using the format argument and returns a text representation of the last page having all HTML markup stripped.

html

The html method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page.

data

The data method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page. Additionally this method when passed data, updates the content of the current page with that data and returns the modified content.

www

The www method is an alias to the self method. This method returns the current scraper application instance.

store

The store method is a shortcut to the WWW::Mechanize save_content method. This method stores the contents of the current page into the specified file. If the content-type does not begin with 'text', the content is saved as binary data.

get $requested_url;
store '/tmp/foo.html';

download

The download method is passed a URI, a Download Directory Path and a optionally a File Path, then it will follow the link and store the response contents into the specified file without leaving the current page. Basically it downloads the contents of the request (especially when the request pushes a file download). If a File Path is not specified, Scrappy will attempt to name the file automatically resorting to a random 6-charater string only if all else fails.

download $requested_url, '/tmp';

# supply your own file name
download $requested_url, '/tmp', 'somefile.txt';

list

The list method is an aesthetically pleasing method of dereferencing an arrayref. This is useful when iterating over a scraped resultset. This method no longer dies if the argument is not an arrayref and instead returns an empty list.

foreach my $item (list var->{items}) {
    ...
}

fst

The fst (first) method shifts the passed in arrayref returning the first element in the array shortening it by one.

var foo => fst grab '.class', { name => 'TEXT' };

lst

The lst (last) method pops the passed in arrayref returning the last element in the array shortening it by one.

var foo => lst grab '.class', { name => 'TEXT' };

session

The session method provides a means for storing important data across executions. There is one special session variable `_file` whose value is used to define the file where session data will be stored. Please make sure the session file exists and is writable. As I am sure you've deduced from the example, the session file will be stored as YAML code. Cookies are automatically stored in and retrieved from your session file automatically.

init;
session _file => '/tmp/foo_session.yml';
session foo => 'bar';
my $var = session->{foo};
# $var == 'bar'

Please make sure to create a valid session file, use the following as an example and note that there is a newline on the alst line of the file:

# scrappy session file
---

config

The config method is an alias to the Scrappy session method for readability.

cookies

The cookies method is a shortcut to the automatically generated WWW::Mechanize cookie handler. This method returns an HTTP::Cookie object. Setting this as undefined using the _undef keyword will prevent cookies from being stored and subsequently read.

init;
get $requested_url;
my $cookies = cookies;

# prevent cookie storage
init;
cookies _undef;

proxy

The proxy method is a shortcut to the WWW::Mechanize proxy function. This method set the proxy for the next request to be tunneled through. Setting this as undefined using the _undef keyword will reset the scraper application instance so that all subsequent requests will not use a proxy.

init;
proxy 'http', 'http://proxy.example.com:8000/';
get $requested_url;

init;
proxy 'http', 'ftp', 'http://proxy.example.com:8000/';
get $requested_url;

# best practice

use Tiny::Try;

init;
proxy 'http', 'ftp', 'http://proxy.example.com:8000/';

try {
    get $requested_url
};

Note! When using a proxy to perform requests, be aware that if they fail your program will die unless you wrap yoru code in an eval statement or use a try/catch module. In the example above we use Tiny::Try to trap an errors that might occur when using a proxy.

pause

The pause method is an adaptation of the WWW::Mechanize::Sleep module. This method sets breaks between your requests in an attempt to simulate human interaction.

init;
pause 20;

get $request_1;
get $request_2;
get $request_3;

The will be a break between each request made, get, post, request, etc., You can also specify a range to have the pause method select from at random...

init;
pause 5,20;

get $request_1;
get $request_2;

# reset/turn it off
pause 0;

print "I slept for ", (pause), " seconds";

Note! The download method is exempt from any automatic pausing, to pause after a download one could obviously...

download $requested_url, '/tmp';
sleep pause();

cursor

The cursor method is used internally by the crawl method to determine what pages in the queue should be fetched next after the completion of the current fetch. This method returns the position of the cursor in the queue.

queue

The queue method is used to add valid URIs to the page fetching queue used by the crawl method internally, or to return the list of added URIs in the order received/input.

queue $new_url;
my @urls = queue;

crawl

The crawl method is designed to automatically and systematically crawl, spider, or fetch webpages and perform actions on selected elements on each page.

crawl $starting_url, {
    'a' => sub {
        # find all links and add them to the queue to be crawled
        queue shift->href;
    },
    '/*' => sub {
        # /* simply matches the root node, same as using 'body' in
        # html page context, maybe do soemthing with shift->text or shift->html
    },
    'img' => sub {
        # print all image URLs
        print shift->src, "\n"
    }
};

history

The history method returns a list of visted pages.

get $url_a;
get $url_b;
get $url_c;

print history;

denied

The denied method is a simple shortcut to determine if the page you requested got loaded or redirected. This method is very useful on systems that require authentication and redirect if not authorized. This function return boolean, 1 if the current page doesn't match the requested page.

get $url_to_dashboard;
if (denied) {
    # do login, again
}
else {
    # resume ...
}

AUTHOR

Al Newkirk <awncorp@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2010 by awncorp.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.