NAME

WWW::ContentRetrieval - WWW robot plus text analyzer

SYNOPSIS

use WWW::ContentRetrieval;
use Data::Dumper;
$robot = WWW::ContentRetrieval->new($desc);
print Dumper $robot->retrieve( $query );

DESCRIPTION

WWW::ContentRetrieval combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common. Users write down a description file and WWW::ContentRetrieval can do fetching and extract desired data.

METHODS

new

  # with site's description only
  $s = new WWW::ContentRetrieval($desc);

  # or
  $s = new WWW::ContentRetrieval($desc_filename);


  # or with full argument list
  $s =
    new WWW::ContentRetrieval(
			      DESC       => $desc,
			      # site's description

			      TIMEOUT    => 3,
			      # default is 10 secs.

			      HTTP_PROXY => 'http://fooproxy:2345/',

			      );

retrieve

$s->retrieve($query) returns an anonymous array of retrieved data.

You may use Data::Dumper to see it.

EXPORT

gentmpl

generates a description template.

Users can do it in a command line,

perl -MWWW::ContentRetrieval -e'print gentmpl'

DESC FILE TUTORIAL

OVERVIEW

WWW::ContentRetrieval uses a pod-like language call CRL, content retrieval language, for users to define a site's description. See WWW::ContentRetrieval::CRL for detail.

Now, suppose the product's query url of "foobar technology" be http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah, then the description is like the following.

$desc = <<'...';

=crl foobar tech.

=fetch

=url http://foo.bar/

=method PLAIN

=param encoding

utf-8

=key product

=case m/./

product

=policy product

mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg
link="http://foo.bar/".$2
name=$3
export=link name

=policy nexturls

blah blah looking for urls

=callback

sub {
    my ($textref, $thisurl) = @_;
    blah blah ...
    write your filter code here
}

=LRC

...

crl

Beginning of site's description. It is followed by the site's name.

fetch

Beginning of fetching block.

url

The web page you are dealing with.

method

PLAIN | GET | POST

param

Web script's parameters. It is followed by key and value.

key

Variable part of paramters. Parameter passed to method retrieve will be joined with the key.

Both param and key are order-sensitive, that is, the order they appears in description file will determine the order in the request url.

case

It takes two arguments; one is a regular expression, another is the name of a page filter.

If page's url matches the pattern, the corresponding filter will be invoked.

See policy and callback parts for detail.

policy and callback

Policy and callback are the guts of this module and they help to extract data from pages.

  • policy

    Policy takes two parameters: a regular expression and lines of data manipulation sublanguage. Here is an example.

    mainmatch=m,<a href=(["'])(.+?)\1>(.+?)</a>,sg
    link="http://foobar:/$2"
    name=$3
    match(link)=m,^http://(.+?),
    site=$1
    replace(link)=s/http/ptth/
    reject(name)=m/^bb/
    export=link name site

    In the first place, use mainmatch to look for desired pattern. Then, users can assign value to self-defined variables and go deeper to capture values using match(variable). Replace can modify extracted text, and reject discards values matching some pattern. Finally, users have to specify which variables to export using export.

  • callback

    If users have to write callback functions for more complex cases, here is the guideline:

    WWW::ContentRetrieval passes two parameters to a callback function: a reference to page's content and page's url.

    E.g.

    sub my_callback{
        my ($textref, $thisurl) = @_;
        while( $$textref =~ /blahblah/g ){
             do some blahblahs here.
        }
        return an array of hashes, with keys and extracted information.
    }

    N.B. Callback's return value should be like the following

    [
     {
      PRODUCT => "foobar",
      PRICE => 256,
     },
     {
      ...
      }
    ];

    If users need to retrieve next page, e.g. dealing with several pages of search results, push an anonymous hash with only one entry: _DTLURL

    {
     _DTLURL => next url,
    }

    See also t/extract.t, t/robot.t

SEE ALSO

WWW::ContentRetrieval::Spider, WWW::ContentRetrieval::Extract, WWW::ContentRetrieval::CRL

CAVEATS

It is still alpha, and the interface is subject to change. Source code is distributed without warranty.

Use it with your own cautions.

TO DO

Login and logout simulation

COPYRIGHT

xern <xern@cpan.org>

This module is free software; you can redistribute it or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 401:

You forgot a '=back' before '=head1'