Changes for version 0.67
- Spider now uses LWP::RobotUA to respect robots.txt. Dependency on WWW::Mechanize is removed.
- Spider authorization features now work. Added bona fide test suite for spidering.
- expand Queue API to add remove() and clean() and internal locking on get()
- Spider->modified_since feature to allow for incremental crawls
- Added new class SWISH::Prog::Aggregator::Spider::Response, refactoring appropriate UA methods into Response class since WWW::Mechanize was intentionally blurring the logical distinction.
- Spider->file_rules (new feature) follows same code path as Aggregator::FS.
- added Utils::write_log and ::write_log_line methods for standardizing debug output
Modules
information retrieval application framework
document aggregation base class
index DB records with Swish-e
crawl a filesystem
crawl a mail box
crawl a filesystem of email messages
index Perl objects with Swish-e
web aggregator
spider response
spider user agent
simple in-memory cache class
base class for SWISH::Prog classes
read/write Swish-e config files
Document object class for passing to SWISH::Prog::Indexer
create document headers for Swish-e -S prog
base indexer class
base class for Swish-e inverted indexes
read/write InvIndex metadata
wrapper around Swish-e binary
the native Swish-e index format
result class for SWISH::API::Object
wrapper for SWISH::API::Object
simple in-memory FIFO queue class
filename mangler
base result class
base results class
base searcher class
test indexer class
utility variables and methods