NAME

WWW::ContentRetrieval - WWW robot plus text analyzer

SYNOPSIS

use WWW::ContentRetrieval;
use Data::Dumper;
$robot = WWW::ContentRetrieval->new($desc);
print Dumper $robot->retrieve( $query );

DESCRIPTION

WWW::ContentRetrieval combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common. Users write down a description file and WWW::ContentRetrieval can do fetching and extract desired data.

METHODS

new

  # with site's description only
  $s = new WWW::ContentRetrieval($desc);

  # or
  $s = new WWW::ContentRetrieval($desc_filename);


  # or with full argument list
  $s =
    new WWW::ContentRetrieval(
			      DESC       => $desc,
			      # site's description

			      TIMEOUT    => 3,
			      # default is 10 secs.

			      HTTP_PROXY => 'http://fooproxy:2345/',

			      );

retrieve

$s->retrieve($query) returns an anonymous array of retrieved data.

You may use Data::Dumper to see it.

EXPORT

gentmpl

generates a description template.

Users can do it in a command line,

perl -MWWW::ContentRetrieval -e'print gentmpl'

DESC FILE TUTORIAL

OVERVIEW

WWW::ContentRetrieval uses a pod-like language call CRL, content retrieval language, for users to define a site's description. See WWW::ContentRetrieval::CRL for detail.

Now, suppose the product's query url of "foobar technology" be http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah, then the description is like the following.

$desc = <<'...';

=crl foobar tech.

=fetch

=url http://foo.bar/

=method PLAIN

=param encoding

utf-8

=key product

=case m/./

product

=policy product

mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg
link="http://foo.bar/".$2
name=$3
export=link name

=policy nexturls

blah blah looking for urls

=callback

sub {
    my ($textref, $thisurl) = @_;
    blah blah ...
    write your filter code here
}

=LRC

...

crl

Beginning of site's description. It is followed by the site's name.

fetch

Beginning of fetching block.

url

The web page you are dealing with.

method

PLAIN | GET | POST

param

Web script's parameters. It is followed by key and value.

key

Variable part of paramters. Parameter passed to method retrieve will be joined with the key.

Both param and key are order-sensitive, that is, the order they appears in description file will determine the order in the request url.

case

It takes two arguments; one is a regular expression, another is the name of a page filter.

If page's url matches the pattern, the corresponding filter will be invoked.

See policy and callback parts for detail.

policy and callback

Policy and callback are the guts of this module and they help to extract data from pages.

policy

Policy takes two parameters: a regular expression and lines of data manipulation sublanguage. Here is an example.
```
mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg
link="http://foobar:/$2"
name=$3
match(link)=m,^http://(.+?),
site=$1
replace(link)=s/http/ptth/
reject(name)=m/^bb/
export=link name site
```
In the first place, use mainmatch to look for desired pattern. Then, users can assign value to self-defined variables and go deeper to capture values using match(variable). Replace can modify extracted text, and reject discards values matching some pattern. Finally, users have to specify which variables to export using export.
callback

If users have to write callback functions for more complex cases, here is the guideline:

WWW::ContentRetrieval passes two parameters to a callback function: a reference to page's content and page's url.

E.g.
```
sub my_callback{
    my ($textref, $thisurl) = @_;
    while( $$textref =~ /blahblah/g ){
         do some blahblahs here.
    }
    return an array of hashes, with keys and extracted information.
}
```
N.B. Callback's return value should be like the following
```
[
 {
  PRODUCT => "foobar",
  PRICE => 256,
 },
 {
  ...
  }
];
```
If users need to retrieve next page, e.g. dealing with several pages of search results, push an anonymous hash with only one entry: _DTLURL
```
{
 _DTLURL => next url,
}
```
See also t/extract.t, t/robot.t

CAVEATS

It is still alpha, and the interface is subject to change. Source code is distributed without warranty.

Use it with your own cautions.

TO DO

COPYRIGHT

xern <xern@cpan.org>

This module is free software; you can redistribute it or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 401:: You forgot a '=back' before '=head1'

To install WWW::ContentRetrieval, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::ContentRetrieval

CPAN shell

perl -MCPAN -e shell
install WWW::ContentRetrieval

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)