NAME

WWW::ContentRetrieval - WWW robot plus text analyzer

SYNOPSIS

  use WWW::ContentRetrieval;
  use Data::Dumper;
  $robot = WWW::ContentRetrieval->new($desc,
				 {
				     TIMEOUT    => 3,
				     HTTP_PROXY => 'http://fooproxy:2345/',
				 });
  print Dumper $robot->retrieve( $query );

DESCRIPTION

WWW::ContentRetrieval combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common, for example, a product catalogue. Users write down a description file and WWW::ContentRetrieval can do fetching and extract desired data. This can be applied to do price comparison or meta search, for instance.

METHODS

new

  $s =
    new WWW::ContentRetrieval(
			      $desc,
			      {
				  TIMEOUT    => 3,
				  # default is 10 seconds.

				  HTTP_PROXY => 'http://fooproxy:2345/',

				  DEBUG      => 1,
				  # non-zero to print out debugging msgs
			      });

retrieve

$s->retrieve($query) returns an anonymous array of retrieved data.

You may use Data::Dumper to see it.

EXPORT

genDescTmpl

generates a description template.

Users can do it in a command line,

perl -MWWW::ContentRetrieval -e'print genDescTmpl'

DESC FILE TUTORIAL

OVERVIEW

Currently, this module uses Perl's native anonymous array and hash for users to write down site descriptions. Let's see an example.

Suppose the product's query url of "foobar technology" be http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah, then the description is like the following: $desc ={ NAME => "foobar tech.", NEXT => [ 'query.pl' => 'detail.pl', ], POLICY => [ 'http://foo.bar/foobarproduct.pl' => \&extraction_callback, ], METHOD => 'GET', QHANDL => 'http://foo.bar/query.pl', PARAM => [ ['encoding', 'UTF8'], ], KEY => 'product', };

NAME

The name of the site.

NEXT is an anonymous array containing pairs of (this pattern => next pattern). If the current url matches /this pattern/, then text is searched for urls that match /next pattern/ and these urls will be queued for next retrieval.

POLICY

POLICY is an anonymous array containing pairs of (this pattern => callback). If the current url matches /this pattern/, then the corresponding callback will be invoked.

WWW::ContentRetrieval passes two parameters to a callback function: a reference to page's content and page's url.

E.g.

sub my_callback{
    my ($textref, $thisurl) = @_;
    while( $$textref =~ /blahblah/g ){
         do some blahblahs here.
    }
    return an array of hashes, with keys and extracted information.
}

N.B. Callback's return value should be like the following

[
 {
  PRODUCT => "foobar",
  PRICE => 256,
 },
 {
  ...
  }
];

If users need WWW::ContentRetrieval to retrieve next page, e.g. dealing with several pages of search results, push an anonymous hash with only one entry: _DTLURL

{
 _DTLURL => next url,
}

METHOD

Request method: GET, POST, or PLAIN.

QHANDL

Query Handler, Url of the query script.

PARAM

Constant script parameters, excluding user's queries.

KEY

Key to user's query strings, e.g. product names

TO DO

A small language for site description

COPYRIGHT

xern <xern@cpan.org>

This module is free software; you can redistribute it or modify it under the same terms as Perl itself.

To install WWW::ContentRetrieval, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::ContentRetrieval

CPAN shell

perl -MCPAN -e shell
install WWW::ContentRetrieval

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

METHODS

new

retrieve

EXPORT

genDescTmpl

DESC FILE TUTORIAL

OVERVIEW

NAME

NEXT

POLICY

METHOD

QHANDL

PARAM

KEY

TO DO

SEE ALSO

COPYRIGHT

Module Install Instructions