ATTRIBUTES
reader
this attribute access your reader class instance
writer
this attribute accesses your writer class instance
benchmark
not ready, i want a catalyst type of method tree like debug for each method for each request
chache
the cache works, with CHI however its only useful right now for GET requests for specific urls. using the cache you will not need to download the page each time, so its good for dev
log
the log is not ready yet however it will be log4perl
parser
The default parser reads content types:
- text/html with HTML::TreeBuilder::XPath
which is in file: lib/HTML/Robot/Scrapper/Parser/HTML/TreeBuilder/XPath.pm
- text/xml with XML::XPath
which is in file: lib/HTML/Robot/Scrapper/Parser/XML/XPath.pm
and the parser is:
-base: lib/HTML/Robot/Scrapper/Parser/Base.pm
override with:
my $robot = HTML::Robot::Scrapper->new (
....
log => {
base_class => 'HTML::Robot::Scrapper::Log::Base', #optional, your custom base class
class => 'Default' #or HTML::Robot::Scrapper::Log::Default
},
...
)
-default: lib/HTML/Robot/Scrapper/Parser/Default.pm
queue
base_class: lib/HTML/Robot/Scrapper/Queue/Base.pm
default class: lib/HTML/Robot/Scrapper/Queue/Default.pm (Simple Instance Array)
you can override the whole thing using a custom base_class, or simply use
a different class
my $robot = HTML::Robot::Scrapper->new (
....
queue => {
base_class => 'HTML::Robot::Scrapper::Queue::Base',
class => 'HTML::Robot::Scrapper::Queue::Default'
},
...
)
useragent
encoding
new
HTML::Robot::Scrapper->new({
#you must create your own reader/parser
reader => {class => 'HTML::Robot::Scrapper::Reader::TestReader',
args => {}, },# or [] or any object
#you must create your own writer to save your collected data
writer => {class => 'HTML::Robot::Scrapper::Writer::TestWriter',
args => {}, },
benchmark => {class => 'Default',
args => {}, },
cache => {class => 'Default',
args => {}, },
log => {class => 'Default',
args => {}, },
parser => {class => 'Default',
args => {}, },
queue => {class => 'Default',
args => {}, },
useragent => {class => 'Default',
args => {}, },
});
before 'start'
- give access to this class inside other custom classes
NAME
HTML::Robot::Scrapper - Your robot to parse webpages
SYNOPSIS
See a working example under the module: WWW::Tabela::Fipe ( search on cpan ).
The class
HTML::Robot::Scrapper::Parser::Default
handles only text/html and text/xml by default
So i need to add an extra option for text/plain and tell it to use
the same method that already parses text/html, here is an example:
* im using the code from the original as base class for this:
HTML::Robot::Scrapper::Parser::Default
Here i will redefine that class and tell my $robot to favor it
...
parser => {class => 'WWW::Tabela::Fipe::Parser'},
...
See below:
package WWW::Tabela::Fipe::Parser;
use Moose;
with('HTML::Robot::Scrapper::Parser::HTML::TreeBuilder::XPath');
with('HTML::Robot::Scrapper::Parser::XML::XPath');
sub content_types {
my ( $self ) = @_;
return {
'text/html' => [
{
parse_method => 'parse_xpath',
description => q{
The method above 'parse_xpath' is inside class:
HTML::Robot::Scrapper::Parser::HTML::TreeBuilder::XPath
These content type related methods will be called inside:
HTML::Robot::Scrapper::UserAgent::Default around:
$robot->parser->engine->$parse_method( $robot, $self->content )
},
}
],
'text/plain' => [
{
parse_method => 'parse_xpath',
description => q{
Here i define which prase_method will treat the page content.
Based on content type for that page. I tell it to use the
same method its using to parse html... because i know in this
case text/plain should be text/html instead.
},
}
],
'text/xml' => [
{
parse_method => 'parse_xml'
},
],
};
}
1;
package FIPE;
use HTML::Robot::Scrapper;
# use CHI;
use HTTP::Tiny;
use HTTP::CookieJar;
my $robot = HTML::Robot::Scrapper->new (
reader => { # REQ
class => 'WWW::Tabela::Fipe',
},
writer => {class => 'WWW::Tabela::FipeWrite',}, #REQ
benchmark => {class => 'Default'},
# cache => {
# class => 'Default',
# args => {
# is_active => 0,
# engine => CHI->new(
# driver => 'BerkeleyDB',
# root_dir => "/home/catalyst/WWW-Tabela-Fipe/cache/",
# ),
# },
# },
log => {class => 'Default'},
parser => {class => 'WWW::Tabela::Fipe::Parser'}, #custom for tb fipe. because they reply with text/plain content type
queue => {class => 'Default'},
useragent => {
class => 'Default',
args => {
ua => HTTP::Tiny->new( cookie_jar => HTTP::CookieJar->new),
}
},
encoding => {class => 'Default'},
);
$robot->start();
DESCRIPTION
This cralwer has been created to be extensible. Scalable with redis queue.
The main idea is: i need a queue of urls to be crawled, it can be an array living during
my instance (not scalable)... or it can be a Redis queue ( scallable ), being acessed by
many HTML::Robot::Scrapper instances.
Each request inserted into the queue is suposed to be independent. So this thing can scale. I mean,
Supose i need to create an object using stuff from page1, page2 and page3... that will be 3 requests
so, the first request will access page1 and collect data into $colleted_data, then, i will append another
request for page2 with $collected_data from page 1. So the request for page2 will collect some more data
and merge with $collected_data from page1 generating $collected_data_from_page1_and_page2, and then i will insert
a new request into my queue for page3 that will collect data and merge with $collected_data_from_page1_and_page2
and create the final object: $collected_data_complete.
Basicly, you need to create a
- reader: to read/parse your webpages and
and a
- writer: to save data you reader collected.
You 'might' need to add other content types also, creating your custom class based on:
HTML::Robot::Scrapper::Parser::Default
See it and you will understand.. by default it handles:
- text/html
- text/xml
READER ( you create this )
Reader: Its where the parsing logic for a specific site lives.
You customize the reader telling it where the nodes are, etc..
The reader class is where you create your parser.
WRITER ( you create this )
Writer: Its the class that will save the data the reader collects.
ie: You can create a method "save" that receives an object and simply writes into your DB.
Or you can make it write into DB + elastic search .. etc.. whatever you want
CONTENT TYPES AND PARSING METHODS ( you might need to extend this )
For example, after making a request call ( HTML::Robot::Scrapper::UserAgent::Default )
it will need to parse data.. and will use the response content type to parse that data
by default the class that handles that is:
package HTML::Robot::Scrapper::Parser::Default;
use Moose;
with('HTML::Robot::Scrapper::Parser::HTML::TreeBuilder::XPath'); #gives parse_xpath
with('HTML::Robot::Scrapper::Parser::XML::XPath'); #gives parse_xml
sub content_types {
my ( $self ) = @_;
return {
'text/html' => [
{
parse_method => 'parse_xpath',
description => q{
The method above 'parse_xpath' is inside class:
HTML::Robot::Scrapper::Parser::HTML::TreeBuilder::XPath
},
}
],
'text/xml' => [
{
parse_method => 'parse_xml'
},
],
};
}
1;
WWW::Tabela::FIPE has a custom Parser class and you can see it as an example.
If you need to download images, you will need to create a custom parser class adding 'image/png' as content type for example.
QUEUE OF REQUESTS
Another example is the Queue system, it has an api: HTML::Robot::Scrapper::Queue::Base and by default
uses: HTML::Robot::Scrapper::Queue::Array which works fine for 1 local instance. However, lets say i want a REDIS queue, so i could
implement HTML::Robot::Scrapper::Queue::Redis and make the crawler access a remote queue.. this way i can share a queue between many crawlers independently.
Just so you guys know, i have a redis module almost ready, it needs litle refactoring because its from another personal project. It will be released asap when i got time.
So, if that does not fit you, or you want something else to handle those content types, just create a new class and pass it on to the HTML::Robot::Scrapper constructor. ie:
see the SYNOPSIS
By default it uses HTTP Tiny and useragent related stuff is in:
HTML::Robot::Scrapper::UserAgent::Default
Project Status
The crawling works as expected, and works great. And the api will not change probably.
TODO
Implement the REDIS Queue to give as option for the Array queue. Array queue runs local/per instance.. and the redis queue can be shared and accessed by multiple machines!
Still need to implement the Log, proper Benchmark with subroutine tree and timing.
Allow parameters to be passed in to UserAgent (HTTP::Tiny on this case)
Better tests and docs.
Example 1 - Append some urls and extract some data
On this first example, it shows how to make a simple crawler... by simple i mean simple GET requests following urls... and grabbing some data.
package HTML::Robot::Scrapper::Reader::TestReader;
use Moose;
with 'HTML::Robot::Scrapper::Reader';
use Data::Printer;
use Digest::SHA qw(sha1_hex);
## The commented stuff is useful as example
has startpage => (
is => 'rw',
default => sub { return 'http://www.bbc.co.uk/'} ,
);
has array_of_data => ( is => 'rw', default => sub { return []; } );
has counter => ( is => 'rw', default => sub { return 0; } );
sub on_start {
my ( $self ) = @_;
$self->append( search => $self->startpage );
$self->append( search => 'http://www.zap.com.br/' );
$self->append( search => 'http://www.uol.com.br/' );
$self->append( search => 'http://www.google.com/' );
}
sub search {
my ( $self ) = @_;
my $title = $self->robot->parser->engine->tree->findvalue( '//title' );
my $h1 = $self->robot->parser->engine->tree->findvalue( '//h1' );
warn $title;
warn p $self->robot->useragent->url ;
push( @{ $self->array_of_data } ,
{ title => $title, url => $self->robot->useragent->url, h1 => $h1 }
);
}
sub on_link {
my ( $self, $url ) = @_;
return if $self->counter( $self->counter + 1 ) > 3;
if ( $url =~ m{^http://www.bbc.co.uk}ig ) {
$self->prepend( search => $url ); # append url on end of list
}
}
sub detail {
my ( $self ) = @_;
}
sub on_finish {
my ( $self ) = @_;
$self->robot->writer->save_data( $self->array_of_data );
}
1;
Example 2 - Tabela FIPE ( append custom request calls )
See the working version at: https://github.com/hernan604/WWW-Tabela-Fipe
This example show an asp website that has those '__EVENTVALIDATION' and '__VIEWSTATE' which must be sent back again on each request... here is the example of such crawler for such website...
This example also demonstrates how one could easily login into a website and crawl it also.
package WWW::Tabela::Fipe;
use Moose;
with 'HTML::Robot::Scrapper::Reader';
use Data::Printer;
use utf8;
use HTML::Entities;
use HTTP::Request::Common qw(POST);
has [ qw/marcas viewstate eventvalidation/ ] => ( is => 'rw' );
has veiculos => ( is => 'rw' , default => sub { return []; });
has referer => ( is => 'rw' );
sub start {
my ( $self ) = @_;
}
has startpage => (
is => 'rw',
default => sub {
return [
{
tipo => 'moto',
url => 'http://www.fipe.org.br/web/indices/veiculos/default.aspx?azxp=1&v=m&p=52'
},
{
tipo => 'carro',
url => 'http://www.fipe.org.br/web/indices/veiculos/default.aspx?p=51'
},
{
tipo => 'caminhao',
url => 'http://www.fipe.org.br/web/indices/veiculos/default.aspx?v=c&p=53'
},
]
},
);
sub on_start {
my ( $self ) = @_;
foreach my $item ( @{ $self->startpage } ) {
$self->append( search => $item->{ url }, {
passed_key_values => {
tipo => $item->{ tipo },
referer => $item->{ url },
}
} );
}
}
sub _headers {
my ( $self , $url, $form ) = @_;
return {
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate',
'Accept-Language' => 'en-US,en;q=0.5',
'Cache-Control' => 'no-cache',
'Connection' => 'keep-alive',
'Content-Length' => length( POST('url...', [], Content => $form)->content ),
'Content-Type' => 'application/x-www-form-urlencoded; charset=utf-8',
'DNT' => '1',
'Host' => 'www.fipe.org.br',
'Pragma' => 'no-cache',
'Referer' => $url,
'User-Agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0',
'X-MicrosoftAjax' => 'Delta=true',
};
}
sub _form {
my ( $self, $args ) = @_;
return [
ScriptManager1 => $args->{ script_manager },
__ASYNCPOST => 'true',
__EVENTARGUMENT => '',
__EVENTTARGET => $args->{ event_target },
__EVENTVALIDATION => $args->{ event_validation },
__LASTFOCUS => '',
__VIEWSTATE => $args->{ viewstate },
ddlAnoValor => ( !exists $args->{ano} ) ? 0 : $args->{ ano },
ddlMarca => ( !exists $args->{marca} ) ? 0 : $args->{ marca },
ddlModelo => ( !exists $args->{modelo} ) ? 0 : $args->{ modelo },
ddlTabelaReferencia => 154,
txtCodFipe => '',
];
}
sub search {
my ( $self ) = @_;
my $marcas = $self->tree->findnodes( '//select[@name="ddlMarca"]/option' );
my $viewstate = $self->tree->findnodes( '//form[@id="form1"]//input[@id="__VIEWSTATE"]' )->get_node->attr('value');
my $event_validation = $self->tree->findnodes( '//form[@id="form1"]//input[@id="__EVENTVALIDATION"]' )->get_node->attr('value');
foreach my $marca ( $marcas->get_nodelist ) {
my $form = $self->_form( {
script_manager => 'UdtMarca|ddlMarca',
event_target => 'ddlMarca',
event_validation=> $event_validation,
viewstate => $viewstate,
marca => $marca->attr( 'value' ),
} );
$self->prepend( busca_marca => 'url' , {
passed_key_values => {
marca => $marca->as_text,
marca_id => $marca->attr( 'value' ),
tipo => $self->robot->reader->passed_key_values->{ tipo },
referer => $self->robot->reader->passed_key_values->{referer },
},
request => [
'POST',
$self->robot->reader->passed_key_values->{ referer },
{
headers => $self->_headers( $self->robot->reader->passed_key_values->{ referer } , $form ),
content => POST('url...', [], Content => $form)->content,
}
]
} );
}
}
sub busca_marca {
my ( $self ) = @_;
my ( $captura1, $viewstate ) = $self->robot->useragent->content =~ m/hiddenField\|__EVENTTARGET(.+)__VIEWSTATE\|([^\|]+)\|/g;
my ( $captura_1, $event_validation ) = $self->robot->useragent->content =~ m/hiddenField\|__EVENTTARGET(.+)__EVENTVALIDATION\|([^\|]+)\|/g;
my $modelos = $self->tree->findnodes( '//select[@name="ddlModelo"]/option' );
foreach my $modelo ( $modelos->get_nodelist ) {
next unless $modelo->as_text !~ m/selecione/ig;
my $kv={};
$kv->{ modelo_id } = $modelo->attr( 'value' );
$kv->{ modelo } = $modelo->as_text;
$kv->{ marca_id } = $self->robot->reader->passed_key_values->{ marca_id };
$kv->{ marca } = $self->robot->reader->passed_key_values->{ marca };
$kv->{ tipo } = $self->robot->reader->passed_key_values->{ tipo };
$kv->{ referer } = $self->robot->reader->passed_key_values->{ referer };
my $form = $self->_form( {
script_manager => 'updModelo|ddlModelo',
event_target => 'ddlModelo',
event_validation=> $event_validation,
viewstate => $viewstate,
marca => $kv->{ marca_id },
modelo => $kv->{ modelo_id },
} );
$self->prepend( busca_modelo => '', {
passed_key_values => $kv,
request => [
'POST',
$self->robot->reader->passed_key_values->{ referer },
{
headers => $self->_headers( $self->robot->reader->passed_key_values->{ referer } , $form ),
content => POST( 'url...', [], Content => $form )->content,
}
]
} );
}
}
sub busca_modelo {
my ( $self ) = @_;
my $anos = $self->tree->findnodes( '//select[@name="ddlAnoValor"]/option' );
foreach my $ano ( $anos->get_nodelist ) {
my $kv = {};
$kv->{ ano_id } = $ano->attr( 'value' );
$kv->{ ano } = $ano->as_text;
$kv->{ modelo_id } = $self->robot->reader->passed_key_values->{ modelo_id };
$kv->{ modelo } = $self->robot->reader->passed_key_values->{ modelo };
$kv->{ marca_id } = $self->robot->reader->passed_key_values->{ marca_id };
$kv->{ marca } = $self->robot->reader->passed_key_values->{ marca };
$kv->{ tipo } = $self->robot->reader->passed_key_values->{ tipo };
$kv->{ referer } = $self->robot->reader->passed_key_values->{ referer };
next unless $ano->as_text !~ m/selecione/ig;
my ( $captura1, $viewstate ) = $self->robot->useragent->content =~ m/hiddenField\|__EVENTTARGET(.*)__VIEWSTATE\|([^\|]+)\|/g;
my ( $captura_1, $event_validation ) = $self->robot->useragent->content =~ m/hiddenField\|__EVENTTARGET(.*)__EVENTVALIDATION\|([^\|]+)\|/g;
my $form = $self->_form( {
script_manager => 'updAnoValor|ddlAnoValor',
event_target => 'ddlAnoValor',
event_validation=> $event_validation,
viewstate => $viewstate,
marca => $kv->{ marca_id },
modelo => $kv->{ modelo_id },
ano => $kv->{ ano_id },
} );
$self->prepend( busca_ano => '', {
passed_key_values => $kv,
request => [
'POST',
$self->robot->reader->passed_key_values->{ referer },
{
headers => $self->_headers( $self->robot->reader->passed_key_values->{ referer } , $form ),
content => POST( 'url...', [], Content => $form )->content,
}
]
} );
}
}
sub busca_ano {
my ( $self ) = @_;
my $item = {};
$item->{ mes_referencia } = $self->tree->findvalue('//span[@id="lblReferencia"]') ;
$item->{ cod_fipe } = $self->tree->findvalue('//span[@id="lblCodFipe"]');
$item->{ marca } = $self->tree->findvalue('//span[@id="lblMarca"]');
$item->{ modelo } = $self->tree->findvalue('//span[@id="lblModelo"]');
$item->{ ano } = $self->tree->findvalue('//span[@id="lblAnoModelo"]');
$item->{ preco } = $self->tree->findvalue('//span[@id="lblValor"]');
$item->{ data } = $self->tree->findvalue('//span[@id="lblData"]');
$item->{ tipo } = $self->robot->reader->passed_key_values->{ tipo } ;
warn p $item;
push( @{$self->veiculos}, $item );
}
sub on_link {
my ( $self, $url ) = @_;
}
sub on_finish {
my ( $self ) = @_;
warn "Terminou.... exportando dados.........";
$self->robot->writer->write( $self->veiculos );
}
DESCRIPTION
AUTHOR
Hernan Lopes
CPAN ID: HERNAN
perldelux / movimentoperl
hernan@cpan.org
http://github.com/hernan604
COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
SEE ALSO
perl(1).