NAME

Scrappy - All Powerful Web Harvester, Spider, Scraper fully automated

VERSION

version 0.6

SYNOPSIS

Scrappy does it all, any way you like. Object-Oriented or DSL (Domain-Specific). Lets look at a simple scraper in OO context.

#!/usr/bin/perl
use Scrappy;

my $spidy = Scrappy->new;

$spidy->crawl('http://search.cpan.org/recent', {
    '#cpansearch li a' => sub {
        print shift->text, "\n";
    }
});

Now lets run the same operation again in DSL context.

#!/usr/bin/perl
use Scrappy qw/:syntax/;

crawl 'http://search.cpan.org/recent', {
    '#cpansearch li a' => sub {
        print shift->text, "\n";
    }
};

DESCRIPTION

Scrappy is an easy (and hopefully fun) way of scraping, spidering, and/or harvesting information from web pages, web services, and more. Scrappy is a feature rich, flexible, intelligent web automation tool.

Scrappy (pronounced Scrap+Pee) == 'Scraper Happy' or 'Happy Scraper'; If you like you may call it Scrapy (pronounced Scrape+Pee) although Python has a web scraping framework by that name and this module is not a port of that one.

Scrappy is approaching version 1.0, taking on critical mass :}

METHODS

init

Builds the scraper application instance. This function is called automatically in DSL context and is otherwise irrelevent. This function creates the application instance all other functions will use in DSL context. This function returns the current scraper application instance.

my $scraper = init;

reinit

The reinit method is an alias to the init method. This function should be called in DSL context when a new scraper application instance in desired. This function will returns the new scraper application instance. Obviously in OO context, one would simply use Scrappy->new to create a new instance.

my $new = reinit;

self

This method returns the current scraper application instance which can also be found in the package class variable $class_Instance. This method has no pratical purpose in OO context and is made available to return the current scraper application instance in DSL context only.

my $self = self;

user_agent

This method gets/sets the user-agent for the current scraper application instance.

user_agent 'Mozilla/5.0 (Windows; U; Windows NT ...';

var

This method sets a stash (shared) variable or returns a reference to the entire stash object.

var age => 31;
print var->{age};

my @array = (1..20);
var integers => @array;

random_ua

This returns a random user-agent string for use with the user_agent method. The user-agent header in your request is how an inquiring application might determine the browser and environment making the request. The first argument should be the name of the web browser, supported web browsers are any, chrome, ie or explorer, opera, safari, and firfox. Obviously using the keyword `any` will select from any available browsers. The second argument which is optional should be the name of the desired operating system, supported operating systems are windows, macintosh, and linux.

user_agent random_ua;
# same as random_ua 'any';

e.g. for a Linux-specific Google Chrome user-agent use the following...

user_agent random_ua 'chrome', 'linux';

form

The form method is a shortcut to the WWW::Mechanize submit_form method. It take the exact same arguments, yada, yada.

form fields => {
    username => 'mrmagoo',
    password => 'foobarbaz'
};

# or more specifically

form form_number => 1, fields => {
    username => 'mrmagoo',
    password => 'foobarbaz'
};

get

The get method is a shortcut to the WWW::Mechanize get method. This method takes a URL or URI and returns an HTTP::Response object.

post

The post method is a shortcut to the WWW::Mechanize post method. This method takes a URL or URI and a hashref of key/value pairs then returns an HTTP::Response object. Alternatively the post object can be used traditionally (ugly), and passed additional arguments;

# our pretty way
post $requested_url, {
    query => 'some such stuff'
};

# traditionally
post $requested_url,
    'Content-Type' => 'multipart/form-data',
    'Content'      => {
        user                => $facebook->{user},
        profile_id          => $prospect->{i},
        message             => '',
        source              => '',
        src                 => 'top_bar',
        submit              => 1,
        post_form_id        => $post_formid,
        fb_dtsg             => 'u9MeI',
        post_form_id_source => 'AsyncRequest'
};

Note! Our prettier version of the post method uses a content-type of application/x-www-form-urlencoded by default, to use multipart/form-data, please use the traditional style, sorry.

param

The param method is used to retrieve querystring parameters from the current request. This includes any parameters defined using the match() method. This method is never used to set parameters.

my $url = 'http://search.cpan.org/search?query=Scrappy&mode=all';
get $url;

print param('query');
# Scrappy

grab

The grab method takes XPATH or CSS3 selectors and returns corresponding elements, it is a shortcut to the Web::Scraper process method. It take the exact same arguments with a little bit of our own added magic, namely you can grab and return a single element and specify whether to return TEXT, HTML or and @attribute. By default the return value of a single-element is TEXT. Whenever you specify a hashref mapping of attributes to grab, the results are returned as an arrayref, this may change in the future.

grab '#profile li a'; # return the inner text of the first encounter
grab '#profile li a', '@href'; # specifically returning href attribute of the first encounter

# the traditional use is to provide a selector and mappings/return values e.g.
grab '#profile li a', { name => 'TEXT', link => '@href' };

# feeling lazy, let Scrappy auto-discover the attributes for you
grab '#profile li a', ':all'; # returns an arrayref if more than one element is found

# Note! elements are returned as objects with accessors making it possible
# to do the following....

my $link = grab '#profile li a:first', ':all';
print $link->href;

grab 'a'; # returns inner text of the first match
grab 'a', 'html'; # returns inner html of the first match
grab 'a', '@href'; # returns the href attribute of the first match

grab 'a', ':all'; # returns an arrayref with all attributes including text, and html
grab 'a', { key => 'attr' }; # returns an arrayref with the specified attributes

Zoom in on specific chunks of html code or pass you own using the following method call:

grab 'element', ':all', $html_content;

loaded

The loaded method is a shortcut to the WWW::Mechanize success method. This method returns true/false based on whether the last request was successful.

get $requested_url;
if (loaded) {
    grab ...
}

status

The status method is a shortcut to the WWW::Mechanize status method. This method returns the 3-digit HTTP status code of the response.

get $requested_url;
if (status == 200) {
    grab ...
}

reload

The reload method is a shortcut to the WWW::Mechanize reload method. This method acts like the refresh button in a browser, repeats the current request.

back

The back method is a shortcut to the WWW::Mechanize back method. This method is the equivalent of hitting the "back" button in a browser, it returns the previous page (response), it will not backtrack beyond the first request.

page

The page method is a shortcut to the WWW::Mechanize uri method. This method returns the URI of the current page as a URI object.

response

The response method is a shortcut to the WWW::Mechanize response method. This method returns the HTTP::Repsonse object of the current page.

content_type

The content_type method is a shortcut to the WWW::Mechanize content_type method. This method returns the content_type of the current page.

domain

The domain method is a shortcut to the WWW::Mechanize base method. This method returns URI host of the current page.

ishtml

The ishtml method is a shortcut to the WWW::Mechanize is_html method. This method returns true/false based on whether our content is HTML, according to the HTTP headers.

title

The title method is a shortcut to the WWW::Mechanize title method. This method returns the content of the title tag if the current page is HTML, otherwise returns undef.

text

The text method is a shortcut to the WWW::Mechanize content method using the format argument and returns a text representation of the last page having all HTML markup stripped.

html

The html method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page.

data

The data method is a shortcut to the WWW::Mechanize content method. This method returns the content of the current page exactly the same as the html function does. Additionally this method when passed data, updates the content of the current page with that data and returns the modified content.

www

The www method is an alias to the self method. This method returns the current scraper application instance.

store

The store method is a shortcut to the WWW::Mechanize save_content method. This method stores the contents of the current page into the specified file. If the content-type does not begin with 'text', the content is saved as binary data.

get $requested_url;
store '/tmp/foo.html';

download

The download method is passed a URI, a Download Directory Path and a optionally a File Path, then it will follow the link and store the response contents into the specified file without leaving the current page. Basically it downloads the contents of the request (especially when the request pushes a file download). If a File Path is not specified, Scrappy will attempt to name the file automatically resorting to a random 6-charater string only if all else fails.

download $requested_url, '/tmp';

# supply your own file name
download $requested_url, '/tmp', 'somefile.txt';

list

The list method is an aesthetically pleasing method of dereferencing an arrayref. This is useful when iterating over a scraped resultset. This method no longer dies if the argument is not an arrayref and instead returns an empty list.

foreach my $item (list var->{items}) {
    ...
}

fst

The fst (first) method shifts the passed in arrayref returning the first element in the array shortening it by one.

var foo => fst grab '.class', { name => 'TEXT' };

lst

The lst (last) method pops the passed in arrayref returning the last element in the array shortening it by one.

var foo => lst grab '.class', { name => 'TEXT' };

session

The session method provides a means for storing important data across executions. There is one special session variable `_file` whose value is used to define the file where session data will be stored. Please make sure the session file exists and is writable. As I am sure you've deduced from the example, the session file will be stored as YAML code. Cookies are automatically stored in and retrieved from your session file automatically.

init;
session _file => '/tmp/foo_session.yml';
session foo => 'bar';
my $var = session->{foo};
# $var == 'bar'

Please make sure to create a valid session file, use the following as an example and note that there is a newline on the alst line of the file:

# scrappy session file
---

config

The config method is an alias to the Scrappy session method for brevity.

cookies

The cookies method is a shortcut to the automatically generated WWW::Mechanize cookie handler. This method returns an HTTP::Cookie object. Setting this as undefined using the _undef keyword will prevent cookies from being stored and subsequently read.

get $requested_url;
my $cookies = cookies;

# prevent cookie storage
cookies _undef;

proxy

The proxy method is a shortcut to the WWW::Mechanize proxy function. This method set the proxy for the next request to be tunneled through. Setting this as undefined using the _undef keyword will reset the scraper application instance so that all subsequent requests will not use a proxy.

proxy 'http', 'http://proxy.example.com:8000/';
get $requested_url;

proxy 'http', 'ftp', 'http://proxy.example.com:8000/';
get $requested_url;

# best practice

use Tiny::Try;

proxy 'http', 'ftp', 'http://proxy.example.com:8000/';

try {
    get $requested_url
};

Note! When using a proxy to perform requests, be aware that if they fail your program will die unless you wrap your code in an eval statement or use a try/catch module. In the example above we use Tiny::Try to trap an errors that might occur when using a proxy.

pause

The pause method is an adaptation of the WWW::Mechanize::Sleep module. This method sets breaks between your requests in an attempt to simulate human interaction.

pause 20;

get $request_1;
get $request_2;
get $request_3;

Given the above example, there will be a 20 sencond break between each request made, get, post, request, etc., You can also specify a range to have the pause method select from at random...

pause 5,20;

get $request_1;
get $request_2;

# reset/turn it off
pause 0;

print "I slept for ", (pause), " seconds";

Note! The download method is exempt from any automatic pausing, to pause after a download one could obviously...

download $requested_url, '/tmp';
sleep pause();

history

The history method returns a list of visted pages.

get $url_a;
get $url_b;
get $url_c;

print history;

denied

The denied method is a simple shortcut to determine if the page you requested got loaded or redirected. This method is very useful on systems that require authentication and redirect if not authorized. This function return boolean, 1 if the current page doesn't match the requested page.

get $url_to_dashboard;

if (denied) {
    # do login, again
}
else {
    # resume ...
}

new

The new method creates a new OO (object-oriented) Scrappy instance. It is worth mentioning that Scrappy can be used in both OO (object-oriented) and DSL (domain-specific) fashion. Both styles have advantages and drawbacks, we have both so that settles that. Please note that a Scrappy instance is created automatically on-the-fly for those using DSL syntax.

my $spidy = Scrappy->new;

cursor

The cursor method is used internally by the crawl method to determine what pages in the queue should be fetched next after the completion of the current fetch. This method returns the position of the cursor in the queue.

queue

The queue method is used to add valid URIs to the page fetching queue used by the crawl method internally, or to return the list of added URIs in the order received/input.

queue $new_url;
my @urls = queue;

match

The match method checks the passed-in URL (or URL of the current page if left empty) the URL pattern (route) defined. If URL is a match, it will return the parameters of that match much in the same way a modern web application framework processes URL routes.

my $url = 'http://somesite.com/tags/awesomeness';
...

# match against the current page
if (match '/tags/:tag') {
    print param('tag');
    # prints awesomeness
}

.. or ..

# match against the passed url
my $this = match '/tags/:tag', $url, {
    host => 'somesite.com'
};

if ($this) {
    print "This is the ", $this->{tag}, " page";
    # prints this is the awesomeness page
}

crawl

The crawl method is designed to automatically and systematically crawl, spider, or fetch webpages and perform actions on selected elements on each page. This method will start by GETting the initial URL passed, it then iterates over each selector executing the corresponding routine for each matched element.

crawl $starting_url, {
    'a' => sub {
        # find all links and add them to the queue to be crawled
        queue shift->href;
    },
    '/*' => sub {
        # /* simply matches the root node, same as using 'body' in
        # html page context, maybe do soemthing with shift->text or shift->html
    },
    'img' => sub {
        # print all image URLs
        print shift->src, "\n"
    }
};

Lets take it a step further and as opposed to matching elements on every page we encounter, lets perform actions on elements that appear on specific types of pages. We do this by utilizing URL pattern matching (also known as URL routing in web application framework context).

crawl 'http://search.cpan.org/recent', {
    'a' => sub {
        my $link = shift;
        queue $link->href if
        match '/~:author/:dist/', $link->href;
    },
    '/~:author/:dist/' => {
        'body', sub {
            print "Howdy, I'm looking at " . param('author') . "\n";
        },
    }
};

Just to recap, the above example starts crawling at http://search.cpan.org/recent, for the first page and every page crawled thereafter, Scrappy will look for the 'a' tag (links) and place them in the queue only if they match the defined URL pattern. Also, you'll notice the slightly different structure for the second action, which denotes a page action. This basically reads, if the current page matches this URL pattern apply the corresponding element actions.

AUTHOR

Al Newkirk <awncorp@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2010 by awncorp.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.