NAME

WWW::SpiTract - WWW robot plus text analyzer

SYNOPSIS

  use WWW::SpiTract;
  use Data::Dumper;
  $spitract = WWW::SpiTract->new($desc,
				 {
				     TIMEOUT => 1,
				     HTTP_PROXY => 'http://fooproxy:2345/',
				 });
  print Dumper $spitract->spitract( $query );

DESCRIPTION

WWW::SpiTract combines the power of a www robot and a text analyzer. It can fetch a series of web pages with some attributes in common, for example, a product catalogue. Users write down a description file and WWW::SpiTract can do fetching and extract desired data. This can be applied to do price comparison or meta search, for instance.

METHODS

new

  $s = WWW::SpiTract->new($desc,
			  {
			      TIMEOUT => 1,
			      HTTP_PROXY => 'http://fooproxy:2345/',
			  });

TIMEOUT is 10 seconds by default

spitract

$s->spitract() returns an anonymous array of retrieved data.

You may use Data::Dumper to see it.

OTHER TOOLS

WWW::SpiTract::bldTree(htmltext)

builds a html-tree text. See also HTML::TreeBuilder

WWW::SpiTract::genDescTmpl

automatically generates a description template.

DESC FILE TUTORIAL

OVERVIEW

Currently, this module uses native Perl's anonymous array and hash for users to write down site descriptions. Let's see an example. Suppose the product query url of "foobar technology" is http://foo.bar/query.pl?encoding=UTF8&product=blahblahblah { SITE => { NAME => "foobar tech.", NEXT => [ 'query.pl' => 'detail.pl', ], POLICY => [ 'http://foo.bar/detail.pl' => [ ["PRODUCT" => "0.1.1.0.0.5.1" ], ["PRICE" => "0.1.1.0.0.5.1.0" ], ], ], METHOD => 'GET', QHANDL => 'http://foo.bar/query.pl', PARAM => [ ['encoding', 'UTF8'], ], KEY => 'product', } };

SITE

Key to the settings.

NAME

The name of the site.

NEXT

NEXT is an anonymous array containing pairs of (this pattern => next pattern). If the current url matches /this pattern/, then text is searched for urls that match /next pattern/ and the urls are queued for next retrieval.

POLICY

POLICY is an anonymous array containing pairs of (this pattern => node settings). If the current url matches /this pattern/, then data at the given node will be retrieved. Format of a slice is like this:

[ NODE_NAME =>
  STARTING_NODE,
  [ VARIABLE INDEX ],
  [ STEPSIZE ],
  [ ENDING ],
  [ sub{FILTER here} ]
 ]

NODE_NAME is the output key to the node data. VARIABLE INDEX is an array of integers, denoting the index numbers of individual digits in starting node at which STARTING_NODE evolves. Using Cartesian product, nodes expand one STEPSIZE one time until digits at VARIABLE INDEX are all identical to those given in ENDING.

FILTER is left to users to write callback functions handling retrieved data.

Except NODE_NAME and STARTING_NODE, all of them are optional.

See also t/extract.pl

  • POLICY example

    [ "PRODUCT" => "0.0.0.0", [ 1, 3 ], [ 1, 2 ], [ 3, 4 ], sub { local $_ = shift; s/\s//g; $_ } ]

    Data at 0.0.0.0, 0.0.0.2, 0.0.0.4, 0.1.0.0, 0.1.0.2, 0.1.0.4, 0.2.0.0, 0.2.0.2, 0.2.0.4, 0.3.0.0, 0.3.0.2, and 0.3.0.4 will be extracted with spaces eliminated.

METHOD

Request method: GET, POST, or PLAIN.

QHANDL

"Query Handler", Url of the query script.

PARAM

Constant script parameters without user's queries.

KEY

Key to user's query strings, e.g. product names

AUTHOR

xern <xern@cpan.org>

LICENSE

Released under The Artistic License.

SEE ALSO

WWW::SpiTract::Spider, WWW::SpiTract::Extract, LWP, WWW::Search