NAME
WWW::ContentRetrieval - WWW robot plus text analyzer
SYNOPSIS
use WWW::ContentRetrieval;
use Data::Dumper;
$robot = WWW::ContentRetrieval->new($desc,
{
TIMEOUT => 3,
HTTP_PROXY => 'http://fooproxy:2345/',
});
print Dumper $robot->retrieve( $query );
DESCRIPTION
WWW::ContentRetrieval combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common, for example, a product catalogue. Users write down a description file and WWW::ContentRetrieval can do fetching and extract desired data. This can be applied to do price comparison or meta search, for instance.
METHODS
new
$s =
new WWW::ContentRetrieval(
$desc,
{
TIMEOUT => 3,
# default is 10 seconds.
HTTP_PROXY => 'http://fooproxy:2345/',
DEBUG => 1,
# non-zero to print out debugging msgs
});
retrieve
$s->retrieve($query) returns an anonymous array of retrieved data.
You may use Data::Dumper to see it.
EXPORT
genDescTmpl
generates a description template.
Users can do it in a command line,
perl -MWWW::ContentRetrieval -e'print genDescTmpl'
DESC FILE TUTORIAL
OVERVIEW
WWW::ContentRetrieval uses YAML for users to define a site's description. YAML is a portable, editable, readable, and extensible language. It can be an alternative for Data::Dumper, and it is designed to define a data structure in a friendly way. Thus, YAML is adopted.
Now, suppose the product's query url of "foobar technology" be http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah, then the description is like the following:
# callback function
sub callback {
my ($textref, $thisurl) = @_;
blah blah
}
# a small processing language
$items = <<'ITEMS';
match=<a href="(.+?)">(.+)</a>
site=$1
url=$2
replace(url)=s/http/ptth/
match=<img src="(.+?)">
photo="http://foo.bar/".$1
ITEMS
# site's description
$desc = <<'...';
NAME: site's name
FETCH:
URL : 'http://foo.bar/query.pl'
METHOD: GET
PARAM:
encoding : UTF8
KEY: product
POLICY:
- m/foo\.bar/ => $items
- m/foo\.bar/ => &callback
NEXT:
- m/./ => m/<a href="(.+?)">.+<\/a>/
- m/./ => $next
...
NAME
Name of the site.
POLICY
POLICY stores information for a certain page's extraction. It is composed of pairs of (this url's pattern => callback function) or (this url's pattern => retrieval settings). If the current url matches /this pattern/, then this modules will invoke the corresponding callback or extract data according to the retrieval settings given by users.
In simple cases, users only need to write down the retrieval settings instead of a callback function. Retrieval settings contains lines of instructions in a /key=value/ format. Here's an example.
# use a leading # for comment
$setting=<<'SETTING';
match=<a href="(.+?)">(.+?)</a>
url=$1
desc="<".$2.">"
replace(url)=s/http/ptth/;
SETTING
Then the module will try to match the pattern in the retrieved page, and assigns the keys with matched values. And, replace follows a substitution matcher, which can transform the specified extracted data.
If users have to write callback functions for more complex cases, here is the guideline:
WWW::ContentRetrieval passes two parameters to a callback function: a reference to page's content and page's url.
E.g.
sub my_callback{
my ($textref, $thisurl) = @_;
while( $$textref =~ /blahblah/g ){
do some blahblahs here.
}
return an array of hashes, with keys and extracted information.
}
N.B. Callback's return value should be like the following
[
{
PRODUCT => "foobar",
PRICE => 256,
},
{
...
}
];
If users need WWW::ContentRetrieval to retrieve next page, e.g. dealing with several pages of search results, push an anonymous hash with only one entry: _DTLURL
{
_DTLURL => next url,
}
See also t/extract.t, t/robot.t
NEXT
Represents URLs to be retrieved in next cycle.
Likewise, this module tries to match the lefthand side with the current url. If they match, the code on the right side will be invoked.
Additional to callback functions and retrieval settings, users can use regular expressions on the right side. Text will be searched for patterns matching the given one, and don't forget to capture desired urls with parentheses.
N.B. Different righthand sides can be attached to the same lefthand side, which means users can process one webpage with multiple strategies.
METHOD
Request method: GET, POST, or PLAIN.
QHANDL
Query Handler
, Url of the query script.
PARAM
Constant script parameters, excluding user's queries.
KEY
Key to user's query strings, e.g. product names
SEE ALSO
WWW::ContentRetrieval::Spider, WWW::ContentRetrieval::Extract
COPYRIGHT
xern <xern@cpan.org>
This module is free software; you can redistribute it or modify it under the same terms as Perl itself.