NAME
WWW::ContentRetrieval - WWW robot plus text analyzer
SYNOPSIS
use WWW::ContentRetrieval;
use Data::Dumper;
$robot = WWW::ContentRetrieval->new($desc);
print Dumper $robot->retrieve( $query );
DESCRIPTION
WWW::ContentRetrieval combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common. Users write down a description file and WWW::ContentRetrieval can do fetching and extract desired data.
METHODS
new
# with site's description only
$s = new WWW::ContentRetrieval($desc);
# or
$s = new WWW::ContentRetrieval($desc_filename);
# or with full argument list
$s =
new WWW::ContentRetrieval(
DESC => $desc,
# site's description
TIMEOUT => 3,
# default is 10 secs.
HTTP_PROXY => 'http://fooproxy:2345/',
);
retrieve
$s->retrieve($query) returns an anonymous array of retrieved data.
You may use Data::Dumper to see it.
EXPORT
gentmpl
generates a description template.
Users can do it in a command line,
perl -MWWW::ContentRetrieval -e'print gentmpl'
DESC FILE TUTORIAL
OVERVIEW
WWW::ContentRetrieval uses a pod-like language call CRL, content retrieval language, for users to define a site's description. See WWW::ContentRetrieval::CRL for detail.
Now, suppose the product's query url of "foobar technology" be http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah, then the description is like the following.
$desc = <<'...';
=crl foobar tech.
=fetch
=url http://foo.bar/
=method PLAIN
=param encoding
utf-8
=key product
=case m/./
product
=policy product
mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg
link="http://foo.bar/".$2
name=$3
export=link name
=policy nexturls
blah blah looking for urls
=callback
sub {
my ($textref, $thisurl) = @_;
blah blah ...
write your filter code here
}
=LRC
...
crl
Beginning of site's description. It is followed by the site's name.
fetch
Beginning of fetching block.
url
The web page you are dealing with.
method
PLAIN | GET | POST
param
Web script's parameters. It is followed by key and value.
key
Variable part of paramters. Parameter passed to method retrieve
will be joined with the key.
Both param and key are order-sensitive, that is, the order they appears in description file will determine the order in the request url.
case
It takes two arguments; one is a regular expression, another is the name of a page filter.
If page's url matches the pattern, the corresponding filter will be invoked.
See policy and callback parts for detail.
policy and callback
Policy and callback are the guts of this module and they help to extract data from pages.
policy
Policy takes two parameters: a regular expression and lines of data manipulation sublanguage. Here is an example.
mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg link="http://foobar:/$2" name=$3 match(link)=m,^http://(.+?), site=$1 replace(link)=s/http/ptth/ reject(name)=m/^bb/ export=link name site
In the first place, use
mainmatch
to look for desired pattern. Then, users can assign value to self-defined variables and go deeper to capture values usingmatch(variable)
.Replace
can modify extracted text, andreject
discards values matching some pattern. Finally, users have to specify which variables to export usingexport
.callback
If users have to write callback functions for more complex cases, here is the guideline:
WWW::ContentRetrieval passes two parameters to a callback function: a reference to page's content and page's url.
E.g.
sub my_callback{ my ($textref, $thisurl) = @_; while( $$textref =~ /blahblah/g ){ do some blahblahs here. } return an array of hashes, with keys and extracted information. }
N.B. Callback's return value should be like the following
[ { PRODUCT => "foobar", PRICE => 256, }, { ... } ];
If users need to retrieve next page, e.g. dealing with several pages of search results, push an anonymous hash with only one entry:
_DTLURL
{ _DTLURL => next url, }
See also t/extract.t, t/robot.t
SEE ALSO
WWW::ContentRetrieval::Spider, WWW::ContentRetrieval::Extract, WWW::ContentRetrieval::CRL
CAVEATS
It is still alpha, and the interface is subject to change. Source code is distributed without warranty.
Use it with your own cautions.
TO DO
Login and logout simulation
COPYRIGHT
xern <xern@cpan.org>
This module is free software; you can redistribute it or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 401:
You forgot a '=back' before '=head1'