NAME
OurNet::FuzzyIndex - Inverted search for double-byte characters
SYNOPSIS
use OurNet::FuzzyIndex;
my $idxfile = 'test.idx'; # Name of the database file
my $pagesize = undef; # Page size (twice of an average record)
my $cache = undef; # Cache size (undef to use default)
my $subdbs = 0; # Number of child dbs; 0 for none
# Initiate the DB from scratch
unlink $idxfile if -e $idxfile;
my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);
# Index a record: key = 'Doc1', content = 'Some text here'
$db->insert('Doc1', 'Some text here');
# Alternatively, parse the content first with different weights
my %words = $db->parse("Some other text here", 5);
%words = $db->parse_xs("Some more texts here", 2, \%words);
# Then index the resulting hash with 'Doc2' as its key
$db->insert('Doc2', %words);
# Perform a query: the 2nd argument is the 'exact match' flag
my %result = $db->query('search for some text', $MATCH_FUZZY);
# Combine the result with another query
%result = $db->query('more please', $MATCH_NOT, \%result);
# Dump the results; note you have to call $db->getkey each time
foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
$val = $result{$idx};
print "Matched: ".$db->getkey($idx)." (score $val)\n";
}
# Set database variables
$db->setvar('variable', "fetch success!\n");
print $db->getvar('variable');
# Get all records: the optional 0 says we want an array of keys
print "These records are indexed:\n";
print join(',', $db->getkeys(0));
# Alternatively, get it with its internal index number
my %allkeys = $db->getkeys(1);
DESCRIPTION
OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.
It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.
This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.
Although this module currently only supports big-5 and latin-1 encodings internally, you could override the parse.c module for extensions, or add your own translation maps.
KNOWN ISSUES
The query()
function uses a time-consuming callback function _parse_q()
to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)
The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.
TODO
Internal handling of locale/unicode mappings
Boolean / selective search using combined MATCH_* flags
Fix bugs concerning sub_dbs
SEE ALSO
fzindex, fzquery, OurNet::ChatBot
AUTHORS
Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
COPYRIGHT
Copyright 2001 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 97:
'=item' outside of any '=over'
- Around line 670:
You forgot a '=back' before '=head1'