NAME
KeywordsSpider::Core - core for web spider searching for keywords
SYNOPSIS
use KeywordsSpider::Core;
my $spider = KeywordsSpider::Core->new(
output_file => $opened_filehandle,
links => \%links,
keywords => \@keywords,
allowed_keywords => \%allowed_keywords,
debug_enabled => 1,
web_depth => 5,
);
DESCRIPTION
KeywordsSpider::Core is core for web spider, which spiders links, and matches their content against keywords. Keyword trigger ALERT to output_file. Allowed keywords do not trigger ALERT.
Websites are defined by 'want_spider' parameter in the links hash. The are spidered to 'web_depth' (default 3), and links in their content are added to links hash. Other links are just checked for keywords, no spidering.
ARGUMENTS
- output_file
-
opened file handle
- keywords
-
array of keywords you want to find
- allowed_keywords
-
hash of keywords which do not trigger ALERT. Like:
my %allowed_keywords = ( wuord1 => 1, );
- links
-
websites and referer urls you want to spider. Like:
my %links = ( 'http://website.sk' => { 'want_spider' => 1, 'depth' => 0, }, 'http://referer.sk' => { 'depth' => 0, }, );
note, that links hash is changed, when running the spider
- debug_enabled
-
prints debug messages to standard output
- web_depth
-
depth to which website will be scanned. Default is 3.
METHODS
- spider_links
-
main method
- settle_website WEBSITE
-
makes necessary settings to spider website
- spider_website
-
scans website according to settings
- check_website
-
checks if url's content matches keywords
- add_links_from_root
-
add links in url's content to links hash
- debug
-
if debug enabled, prints string to standard output
SAMPLE OUTPUT
SPIDER http://domain.sk
this IS NOT counted as alerted
----------------------------------------------------------------------
SPIDER LINKS
SPIDER http://trololo.sk
ERROR:404 Not Found
this IS NOT counted as alerted
SPIDER LINKS
SPIDER http://domain.sk/old.html
possible bad content http://domain.sk/old.html word2
found keywords: 1
fetching http://domain.sk/new.html
ALERT possible bad content http://domain.sk/new.html wuord1 word2
found keywords: 2
fetching http://domain.sk/lala.txt
SKIPPING because of content type or length
SPIDER http://domain.sk
this IS counted as alerted
SEE ALSO
KeywordsSpider -- takes files as arguments and prepares attributes for KeywordsSpider::Core
COPYRIGHT
Copyright 2013 Katarina Durechova
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.