NAME
KeywordsSpider::Core - core for web spider searching for keywords
SYNOPSIS
use KeywordsSpider::Core;
my $spider = KeywordsSpider::Core->new(
  output_file => $opened_filehandle,
  links => \%links,
  keywords => \@keywords,
  allowed_keywords => \%allowed_keywords,
  debug_enabled => 1,
  web_depth => 5,
);
DESCRIPTION
KeywordsSpider::Core is core for web spider, which spiders links, and matches their content against keywords. Keyword trigger ALERT to output_file. Allowed keywords do not trigger ALERT.
Websites are defined by 'want_spider' parameter in the links hash. The are spidered to 'web_depth' (default 3), and links in their content are added to links hash. Other links are just checked for keywords, no spidering.
ARGUMENTS
- output_file
 - 
opened file handle
 - keywords
 - 
array of keywords you want to find
 - allowed_keywords
 - 
hash of keywords which do not trigger ALERT. Like:
my %allowed_keywords = ( wuord1 => 1, ); - links
 - 
websites and referer urls you want to spider. Like:
my %links = ( 'http://website.sk' => { 'want_spider' => 1, 'depth' => 0, }, 'http://referer.sk' => { 'depth' => 0, }, );note, that links hash is changed, when running the spider
 - debug_enabled
 - 
prints debug messages to standard output
 - web_depth
 - 
depth to which website will be scanned. Default is 3.
 
METHODS
- spider_links
 - 
main method
 - settle_website WEBSITE
 - 
makes necessary settings to spider website
 - spider_website
 - 
scans website according to settings
 - check_website
 - 
checks if url's content matches keywords
 - add_links_from_root
 - 
add links in url's content to links hash
 - debug
 - 
if debug enabled, prints string to standard output
 
SAMPLE OUTPUT
SPIDER http://domain.sk
this IS NOT counted as alerted
----------------------------------------------------------------------
SPIDER LINKS
SPIDER http://trololo.sk
ERROR:404 Not Found
this IS NOT counted as alerted
SPIDER LINKS
SPIDER http://domain.sk/old.html
possible bad content http://domain.sk/old.html word2
found keywords: 1
fetching http://domain.sk/new.html
ALERT possible bad content http://domain.sk/new.html  wuord1 word2
found keywords: 2
fetching http://domain.sk/lala.txt
SKIPPING because of content type or length
SPIDER http://domain.sk
this IS counted as alerted
SEE ALSO
KeywordsSpider -- takes files as arguments and prepares attributes for KeywordsSpider::Core
COPYRIGHT
Copyright 2013 Katarina Durechova
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.