NAME
OurNet::FuzzyIndex - Inverted index for double-byte characters
SYNOPSIS
use OurNet::FuzzyIndex;
my $idxfile = 'test.idx'; # Name of the database file
my $pagesize = undef; # Page size (twice of an average record)
my $cache = undef; # Cache size (undef to use default)
my $subdbs = 0; # Number of child dbs; 0 for none
# Initiate the DB from scratch
unlink $idxfile if -e $idxfile;
my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);
# Index a record: key = 'Doc1', content = 'Some text here'
$db->insert('Doc1', 'Some text here');
# Alternatively, parse the content first with different weights
my %words = $db->parse("Some other text here", 5);
%words = $db->parse_xs("Some more texts here", 2, \%words);
# Then index the resulting hash with 'Doc2' as its key
$db->insert('Doc2', %words);
# Perform a query: the 2nd argument is the 'exact match' flag
my %result = $db->query('search for some text', $MATCH_FUZZY);
# Combine the result with another query
%result = $db->query('more please', $MATCH_NOT, \%result);
# Dump the results; note you have to call $db->getkey each time
foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
$val = $result{$idx};
print "Matched: ".$db->getkey($idx)." (score $val)\n";
}
# Set database variables
$db->setvar('variable', "fetch success!\n");
print $db->getvar('variable');
# Get all records: the optional 0 says we want an array of keys
print "These records are indexed:\n";
print join(',', $db->getkeys(0));
# Alternatively, get it with its internal index number
my %allkeys = $db->getkeys(1);
DESCRIPTION
OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.
It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.
Although this module currently only supports big-5 and latin-1 encodings internally, you could override the parse.c
module for extensions, or add your own translation maps.
KNOWN ISSUES
The query()
function uses a time-consuming callback function _parse_q to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)
The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the conetnt exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.
TODO
* Internal handling of locale/unicode mappings * Boolean / selective search using combined MATCH_* flags * Fix bugs concerning sub_dbs
AUTHORS
Autrijus Tang <autrijus@autrijus.org>
COPYRIGHT
Copyright 2001 by Autrijus Tang <autrijus@autrijus.org>.
All rights reserved. You can redistribute and/or modify this module under the same terms as Perl itself.